Product Reviews Clustering

Author

Yuzhou Fu

1. Project Introduction

In the fast-goods industry, customer feedback is crucial for maintaining product quality and brand reputation. With thousands of reviews submitted daily on platforms like Sephora, manually analyzing this unstructured text data is impossible.

The motivation and goal for this analysis is to apply unsupervised learning techniques to automatically uncover distinct themes within customer reviews, transforming raw text information into actionable business insights.

The intended audience for this analysis is the product development or other related teams at a beauty company. By identifying specific clusters of customer feedback, the team can take targeted next steps: for instance, if a cluster reveals consistent complaints about “leaking bottles”, the team can redesign the packaging; if another cluster highlights “skin irritation”, the formula can be revisited for safety testing.

The dataset we will be using is Sephora Products and Skincare Reviews from Kaggle, which was collected via Python scraper in March 2023 and contains:

  • Product Dataset: information about all beauty products (over 8,000) from the Sephora online store, including product and brand names, prices, ingredients, ratings, and all features.

  • Customer Review Dataset: user reviews (about 1 million on over 2,000 products) of all products from the Skincare category, including user appearances, and review ratings by other users.

Given the limited computational resources available and practical purposes for this project, we adopted a two-stage data approach. The complete dataset was used for data cleaning and EDA to identify global trends. For the computationally intensive clustering algorithms, we utilized a subset of approximately 10,000 stratified sample reviews to ensure efficient model training and parameter tuning.

Code
# import relevant libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


from collections import defaultdict
from wordcloud import STOPWORDS
import string

import ast
import re

import operator
import unicodedata
from collections import Counter

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from umap import UMAP
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score, silhouette_score
import hdbscan
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.manifold import trustworthiness
from scipy.spatial.distance import pdist, squareform
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
from sklearn.metrics import adjusted_rand_score, silhouette_samples

import warnings
warnings.filterwarnings("ignore", message="MiniBatchKMeans is known to have a memory leak on Windows")

from wordcloud import WordCloud
import math

plt.style.use("seaborn-v0_8-white")

import plotly.io as pio
pio.renderers.default = 'notebook'

2. Data Loading

Load product dataset.

Code
product_df = pd.read_csv("dataset/Sephora/product_info.csv")
product_df.head()
product_id product_name brand_id brand_name loves_count rating reviews size variation_type variation_value ... online_only out_of_stock sephora_exclusive highlights primary_category secondary_category tertiary_category child_count child_max_price child_min_price
0 P473671 Fragrance Discovery Set 6342 19-69 6320 3.6364 11.0 NaN NaN NaN ... 1 0 0 ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... Fragrance Value & Gift Sets Perfume Gift Sets 0 NaN NaN
1 P473668 La Habana Eau de Parfum 6342 19-69 3827 4.1538 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 85.0 30.0
2 P473662 Rainbow Bar Eau de Parfum 6342 19-69 3253 4.2500 16.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
3 P473660 Kasbah Eau de Parfum 6342 19-69 3018 4.4762 21.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
4 P473658 Purple Haze Eau de Parfum 6342 19-69 2691 3.2308 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0

5 rows × 27 columns

2.1 Column Type Inconsistency (Review Dataset)

Load customer review dataset.

Code
review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_27984\3989447197.py:1: DtypeWarning:

Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.

By loading the first partial customer review dataset, we see that there are columns having data type inconsistency. We need to check and correct this issue for each partial review dataset.

The problematic columns are author_id and user profile columns, and the primary problem is that some author_ids are stored as string type instead of an integer, every new encountered string author_id will be replaced with an integer, starting from 1. User profile columns such as skin_tone and eye_color contain both NA values and strings. All NA values in these columns have been replaced with ‘No_profile’. Same to the review_text and review_title, all NA values will be replaced with ‘No_review’ and ‘No_review_title’ respectively.

Code
# check data tpye in each column 
def column_type_check(df): # data frame
    print('Columns that have multiple data types: ')

    for column in df.columns:
        if len(df[column].apply(type).value_counts()) >= 2:
            print(' ', column)

2.1.1 First Partial Review Dataset

Code
review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)
column_type_check(review_df_1)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_27984\467310262.py:1: DtypeWarning:

Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
#author_id
mask = (review_df_1.loc[:,'author_id'] == 'dummyUser')
review_df_1.loc[mask,'author_id'] = 1

mask = review_df_1.iloc[:,0].apply(lambda x: isinstance(x, str)) 
a = review_df_1[mask]
mask_idx = a[a.loc[:,'author_id'].str.contains(r'order', regex=False)].index # for the str 'order...'

unique_id = review_df_1.loc[mask_idx, 'author_id'].unique()

# create id mapping for storing id generated 
author_id_mapping = {old: new_id for new_id, old in enumerate(unique_id, start= 2)}

review_df_1.loc[mask_idx, 'author_id'] = review_df_1.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_1['author_id'] = review_df_1['author_id'].astype(int)

#review_text
mask = review_df_1.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_1.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_1.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_1.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_1.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_1.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'hair_color'] = 'No_profile'

2.1.2 Second Partial Review Dataset

Code
review_df_2 = pd.read_csv("dataset/Sephora/reviews_250-500.csv", index_col=0)
column_type_check(review_df_2)
Columns that have multiple data types: 
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()
array(['orderGen1254820', 'orderGen1221842', 'orderGen1698648',
       'orderGen53499', 'orderGen51156', 'orderGen333757',
       'orderGen5563740'], dtype=object)
Code
list(author_id_mapping.keys())
['orderGen51156',
 'orderGen2124216',
 'orderGen703225',
 'orderGen5563740',
 'orderGen270100',
 'orderGen1221842',
 'orderGen1254820',
 'orderGen1253445',
 'orderGen1937304',
 'orderGen3046665',
 'orderGen1711826',
 'orderGen309293',
 'orderGen1698648',
 'orderGen39837',
 'orderGen899861']
Code
# author_id
print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()
      , '\n')

check_idx = np.isin(review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 17)}

author_id_mapping = author_id_mapping | author_id_mapping_conca

review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'] = (
    review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].map(author_id_mapping).astype(int)
)

review_df_2['author_id'] = review_df_2['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861'] 

unique string id in this partial dataframe:  ['orderGen1254820' 'orderGen1221842' 'orderGen1698648' 'orderGen53499'
 'orderGen51156' 'orderGen333757' 'orderGen5563740'] 

author_id is in the author id mapping:  [ True  True  True False  True False  True]
Code
#review_text
mask = review_df_2.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_2.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_2.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_2.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_2.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_2.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'hair_color'] = 'No_profile'

2.1.3 Third Partial Review Dataset

Code
review_df_3 = pd.read_csv("dataset/Sephora/reviews_500-750.csv", index_col=0)
column_type_check(review_df_3)
Columns that have multiple data types: 
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
# author_id
print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique()
      , '\n')

check_idx = np.isin(review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 19)}

author_id_mapping = author_id_mapping | author_id_mapping_conca

review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'] = (
    review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].map(author_id_mapping).astype(int)
)

review_df_3['author_id'] = review_df_3['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757'] 

unique string id in this partial dataframe:  ['orderGen5563740' 'orderGen1474435' 'orderGen1698648'] 

author_id is in the author id mapping:  [ True False  True]
Code
#review_text
mask = review_df_3.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_3.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_3.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_3.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_3.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_3.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'hair_color'] = 'No_profile'

2.1.4 Fourth Partial Review Dataset

Code
review_df_4 = pd.read_csv("dataset/Sephora/reviews_750-1250.csv", index_col=0)
column_type_check(review_df_4)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_27984\136717371.py:1: DtypeWarning:

Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
# author_id
mask = review_df_4['author_id'].apply(lambda x: isinstance(x, str))
a = review_df_4.loc[mask, 'author_id']

print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', a[~ a.str.isnumeric()].unique()
      , '\n')

check_idx = np.isin(a[~a.str.isnumeric()].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = a[~a.str.isnumeric()].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 20)}

author_id_mapping = author_id_mapping | author_id_mapping_conca


mask_idx = a[a.str.contains(r'order', regex=False)].index # for the str 'order...'

review_df_4.loc[mask_idx, 'author_id'] = review_df_4.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_4['author_id'] = review_df_4['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757', 'orderGen1474435'] 

unique string id in this partial dataframe:  ['orderGen1698648' 'orderGen1566769' 'orderGen1221842' 'orderGen2124216'
 'orderGen3046665' 'orderGen51156' 'orderGen1947347'] 

author_id is in the author id mapping:  [ True False  True  True  True  True False]
Code
#review_text
mask = review_df_4.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_4.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_4.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_4.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_4.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_4.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'hair_color'] = 'No_profile'

2.1.5 Fifth Partial Review Dataset

Code
review_df_5 = pd.read_csv("dataset/Sephora/reviews_1250-end.csv", index_col=0)
column_type_check(review_df_5)
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_27984\1811167476.py:1: DtypeWarning:

Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
Code
# author_id
mask = review_df_5['author_id'].apply(lambda x: isinstance(x, str))
a = review_df_5.loc[mask, 'author_id']

print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', a[~ a.str.isnumeric()].unique()
      , '\n')

check_idx = np.isin(a[~a.str.isnumeric()].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

mask_idx = a[a.str.contains(r'order', regex=False)].index # for the str 'order...'

review_df_5.loc[mask_idx, 'author_id'] = review_df_5.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_5['author_id'] = review_df_5['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757', 'orderGen1474435', 'orderGen1566769', 'orderGen1947347'] 

unique string id in this partial dataframe:  ['orderGen1947347' 'orderGen1698648' 'orderGen3046665'] 

author_id is in the author id mapping:  [ True  True  True]
Code
#review_text
mask = review_df_5.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_5.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_5.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_5.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_5.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_5.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'hair_color'] = 'No_profile'

2.1.6 Combine All Datasets

After all the corrections, we combine all our partial review datasets.

Additionally, we assign all the texts a unique ID.

Code
df_lis = [review_df_1, review_df_2, review_df_3, review_df_4, review_df_5]

review_df_all = pd.concat(df_lis, ignore_index=True)

# assign each review a unique ID
review_df_all['review_id'] = pd.factorize(review_df_all['review_text'])[0]

review_df_all.head()
author_id rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count submission_time review_text review_title skin_tone eye_color skin_type hair_color product_id product_name brand_name price_usd review_id
0 1741593524 5 1.0 1.0 2 0 2 2023-02-01 I use this with the Nudestix “Citrus Clean Bal... Taught me how to double cleanse! No_profile brown dry black P504322 Gentle Hydra-Gel Face Cleanser NUDESTIX 19.0 0
1 31423088263 1 0.0 NaN 0 0 0 2023-03-21 I bought this lip mask after reading the revie... Disappointed No_profile No_profile No_profile No_profile P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 1
2 5061282401 5 1.0 NaN 0 0 0 2023-03-21 My review title says it all! I get so excited ... New Favorite Routine light brown dry blonde P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 2
3 6083038851 5 1.0 NaN 0 0 0 2023-03-20 I’ve always loved this formula for a long time... Can't go wrong with any of them No_profile brown combination black P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 3
4 47056667835 5 1.0 NaN 0 0 0 2023-03-20 If you have dry cracked lips, this is a must h... A must have !!! light hazel combination No_profile P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 4

2.2 Review Duplicate

After assigning a unique text ID, we notice that a single product might have multiple repeated reviews. Below are some examples:

Code
mask = review_df_all['review_id'].value_counts()>1
duplicate_idx = review_df_all['review_id'].value_counts()[mask].index
duplicate_idx = duplicate_idx[1:] # filter NA (entried as 'No_review')

duplcate_df = review_df_all[review_df_all['review_id'].isin(duplicate_idx)]
print(duplcate_df.groupby(['product_id', 'review_id'])['review_id'].agg(['count']).sort_values(by = 'count', ascending=False).head())

# non-duplicate dataframe
non_duplicate_idx = review_df_all['review_id'].value_counts()[~mask].index
#non_duplcate_df = review_df_all[review_df_all['review_id'].isin(duplicate_idx)]
                      count
product_id review_id       
P377368    782703        59
P384537    730674        14
P139000    292121        13
P122661    619906        13
P384537    730673         9

What we want is a single product ID corresponds to a unqiue review ID. Let’s remove those duplicates. And our strategy is to keep the first occurence.

Code
mask = review_df_all["review_id"].isin(duplicate_idx)
review_df_all_deduplicate = review_df_all.loc[mask].drop_duplicates(subset=['product_id', 'review_id'], keep='first')

Check again.

Code
print('de-duplicates data frame: ', '\n',
      review_df_all_deduplicate.groupby(['product_id', 'review_id'])['review_id'].agg(['count']).value_counts()
)
de-duplicates data frame:  
 count
1        241827
Name: count, dtype: int64

Combine processed data frame and non-duplicate data frame. Some indixes of the processed data frame are in the non-duplicate data frame.

Code
print('number of processed data frame indixes that are in the non-duplicate data frame: ', np.isin(review_df_all_deduplicate.index,non_duplicate_idx).sum())
number of processed data frame indixes that are in the non-duplicate data frame:  199565
Code
# only need add processed idx that are not in the non-duplicate data frame
idx_to_be_added = review_df_all_deduplicate.index[~ np.isin(review_df_all_deduplicate.index,non_duplicate_idx)]
non_duplicate_idx = np.concatenate([non_duplicate_idx, idx_to_be_added])

review_df_all_deduplicate = review_df_all.loc[non_duplicate_idx]

2.3 NA Value

Check any empty text entries, which we have labeled them as ‘No_review’.

Code
print('number of review entered as \'No_review\': ', (review_df_all_deduplicate['review_text'] == 'No_review').sum())
number of review entered as 'No_review':  1132
Code
review_df_all_deduplicate = review_df_all_deduplicate[review_df_all_deduplicate['review_text'] != 'No_review']

3. EDA

Now we can do the EDA to better understand our data.

3.1 Product Level

Code
product_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8494 entries, 0 to 8493
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   product_id          8494 non-null   object 
 1   product_name        8494 non-null   object 
 2   brand_id            8494 non-null   int64  
 3   brand_name          8494 non-null   object 
 4   loves_count         8494 non-null   int64  
 5   rating              8216 non-null   float64
 6   reviews             8216 non-null   float64
 7   size                6863 non-null   object 
 8   variation_type      7050 non-null   object 
 9   variation_value     6896 non-null   object 
 10  variation_desc      1250 non-null   object 
 11  ingredients         7549 non-null   object 
 12  price_usd           8494 non-null   float64
 13  value_price_usd     451 non-null    float64
 14  sale_price_usd      270 non-null    float64
 15  limited_edition     8494 non-null   int64  
 16  new                 8494 non-null   int64  
 17  online_only         8494 non-null   int64  
 18  out_of_stock        8494 non-null   int64  
 19  sephora_exclusive   8494 non-null   int64  
 20  highlights          6287 non-null   object 
 21  primary_category    8494 non-null   object 
 22  secondary_category  8486 non-null   object 
 23  tertiary_category   7504 non-null   object 
 24  child_count         8494 non-null   int64  
 25  child_max_price     2754 non-null   float64
 26  child_min_price     2754 non-null   float64
dtypes: float64(7), int64(8), object(12)
memory usage: 1.7+ MB

First we want to know:

How many brands in this product dataset? How many products in this product dataset?

Code
print('Brand number: ', len(product_df['brand_name'].unique()))
print('Product number: ', len(product_df['product_name'].unique()))
Brand number:  304
Product number:  8415

3.1.1 Rating

Now, let’s explore the average rating of the products. Notice that the value of rating is continuously distributed in [0, 5]. We discretize the variable first.

Code
product_df['rating'].head()
0    3.6364
1    4.1538
2    4.2500
3    4.4762
4    3.2308
Name: rating, dtype: float64

Plot distribution.

Code
bins = [0, 1, 2, 3, 4, 5]
labels = ["(0,1]", "(1,2]", "(2,3]", "(3,4]", "(4,5]"]
product_df["rating_interval"] = pd.cut(product_df["rating"], bins=bins, right=True, labels=labels, include_lowest=False)

ax = sns.barplot(product_df["rating_interval"].value_counts().sort_index())
ax.bar_label(ax.containers[0], fontsize = 10)
ax.set_title('Distribution of Product Rating')
plt.grid(True)
plt.show()

Most of the products are at or above the average level. Let’s also draw a suburst chart to explore the relation between rating and First product category, Second product category , Third product category.

Code
df_plot = product_df[~ product_df['tertiary_category'].isna()] # the chart is not robust to na value

print('Number of rating NA value: ', product_df['tertiary_category'].isna().sum())

fig = px.sunburst(
    df_plot,
    path = ['primary_category', 'secondary_category', 'tertiary_category'],
    color = 'rating',
    maxdepth=-1,
    labels={'rating': 'Product Rating'}
)

fig.update_layout(
    width = 1200,
    height = 900,
    title={
        'text':"Proudct Category and Rating",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()
#fig.write_html("my_sunburst_chart.html", include_plotlyjs=True)
Number of rating NA value:  990

Many sub-categories under Fragrance section show deep color, indicating significantly lower ratings compared to the bright orange/yellow. The users are more easily dissatisfied with Sephora’s fragrance-type product offerings.

Skincare (such as “Moisturizers” and “Face Oils”) is dominated by bright yellow and light orange hues, suggesting consistently high customer satisfaction. Makeup, fragrance, and hair show darker orange and reddish tones. This indicates while these categories have a massive product volume, it is more likely to receive lower ratings, whereas skincare serves as the reliable, high-quality backbone of the catalog.

After exploring the relation between product rating and category, how about the product price and product categories?

3.1.2 Price

Code
fig = px.sunburst(
    df_plot,
    path = ['primary_category', 'secondary_category', 'tertiary_category'],
    color = 'price_usd',
    maxdepth=-1,
    labels={'price_usd': 'Product Price'}
)

fig.update_layout(
    width = 1200,
    height = 900,
    title={
        'text':"Proudct Category and Price",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()

Notice that there is a very thin but distinct labeled “High Tech Tools” (and nearby “Anti-Aging”) that appears bright yellow, this indicates while the majority of the category is affordable, this specific sub-category represents the premium price ceiling of Sephora’s inventory, standing out sharply against the rest of the dark purple chart.

The massive makeup section (eye, face, lip …) is almost entirely deep purple, indicating a consistently low price point compared to other categories. Makeup unlike skincare or fragrance which have more price variation, has a low barrier to entry cost-wise, it is an accessible category.

By contrast, the overall fragrance section is colored in a lighter tone, indicating that the fragrance products generally sit in a “mid-to-high” price tier.

How about we summarise all the numerical variables in the product dataset?

3.1.3 All Numerical Features

Code
df_plot = product_df[~product_df['reviews'].isna()]

fig = px.parallel_coordinates(
    df_plot.iloc[:,[4, 5, 6, 12]], 
    color="price_usd", 
    color_continuous_scale=px.colors.diverging.Spectral, 
    labels={'price_usd': 'Product Price'},
)

fig.update_layout(
    title={
        'text': "Product Price against Product Numeric Features",
        'x': 0.5,
        'y': 0.99,
        'xanchor': 'center'
    },
    margin=dict(t=80)
)

fig.show()

As the price peaks at the far right, the line dips to near zero on both the reviews and loves_count axes (blue line). There is a clear trade-off between price and engagement. Premium products (e.g. “high tech tools” category) generate almost no community buzz compared to the rest of the categories, likely due to low sales volume.

We see that the red lines (low price products) dominate the top peaks of the loves_count (reaching 1.4M+) and the reviews (reaching 20k+). This shows customers are in favor of cheaper products, the virality is exclusive to the lower price tier.

How about the non-numeric feature?

3.1.4 highlights

Highlights of a product recorded is a list of tags or features that highlight the product’s attributes (e.g. [‘Vegan’, ‘Matte Finish’]).

We extract the top 20 most common keywords across all the highlights of products. Then compare their rating, loves_count, and reviews.

This envolves the creating of uni-gram. We will write a function for creating n-grams.

Note: before generating the unigram, cleaned text is required, so we will also operate necessary text cleaning for the texts in highlights column.

Code
# create n gram
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]
Code
df_highlights = product_df[~ product_df['highlights'].isna()].copy()

df_highlights['highlights_list'] = df_highlights['highlights'].apply(ast.literal_eval)
df_highlights['highlights_list'] = df_highlights['highlights_list'].apply(lambda x: ' '.join(x))

df_highlights['highlights_list'] = df_highlights['highlights_list'].str.replace(r'(?<!\w)-(?!\w)', ' ', regex=True)
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.replace(r'[^\w\s-]', ' ', regex=True)
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.strip()
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.split()
df_highlights['highlights_list'] = df_highlights['highlights_list'].apply(lambda x: ' '.join(x))

df_highlights['unigram'] = df_highlights['highlights_list'].apply(lambda x: generate_ngrams(x, n_gram=1))

unigrams = defaultdict(int)

for row in df_highlights['unigram']:
    for word in row:
        unigrams[word] += 1

df_unigram = pd.DataFrame(sorted(unigrams.items(), key=lambda x: x[1])[::-1])

Plot the corresponding boxplots.

Note: for reviews and loves_count, we apply \(x \rightarrow log(1+x)\) transformation, otherwise the plots will be too skewed.

Code
rows = list(df_unigram.loc[df_unigram.iloc[:, 1] >= 900, 0]) # Top 20
col_names = ["rating", "reviews", "loves_count"]

for col in col_names:
    plot_data = []
    for word in rows:
        sub = df_highlights[df_highlights["highlights_list"]
                            .str.contains(word, case=False, regex=False)][col]
        vals = np.log1p(sub) if col != "rating" else sub
        plot_data.append(pd.DataFrame({
            "word": word,
            col: vals
        }))
    plot_df = pd.concat(plot_data, ignore_index=True)

    plt.figure(figsize=(12, 6))
    sns.boxplot(data=plot_df, x="word", y=col, color="skyblue")
    plt.xticks(rotation=45, ha="right")
    plt.xlabel("keyword")
    plt.ylabel(col if col == "rating" else f"log (1 + {col})")
    plt.title(f"{col} by Top-20 highlights keyword")
    plt.tight_layout()
    plt.show()

Based on the 3 plots above, there is no much variation across the keywords, highlight keyword is not the sole factor impacting the numercial metrics here.

3.2 Review Level

First, how many customers posted at least 1 review?

Code
print('Unique customers: ', len(review_df_all_deduplicate['author_id'].unique()))
Unique customers:  432578

Then, what does the time interval between posted reviews look like?

3.2.1 Time Range

Code
review_df_all_deduplicate['submission_year'] = pd.to_datetime(review_df_all_deduplicate['submission_time']).dt.year
review_df_all_deduplicate['submission_month'] = pd.to_datetime(review_df_all_deduplicate['submission_time']).dt.month

plot_data = review_df_all_deduplicate.loc[:, ['submission_year', 'submission_month']]
plot_data = plot_data.groupby(["submission_year", "submission_month"]).size().reset_index(name="count")

fig = px.sunburst(
    plot_data,
    path=["submission_year", "submission_month"],
    values="count",
    color="count",
    color_continuous_scale="YlOrRd",
    labels={"count": "Reviews Count by Month"},
)

fig.update_layout(
    width = 1200,
    height = 900,
    title={
        'text':"Reviews Count",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()

Note: ignore the color presentation (difference) in the middle pie chart.

The reviews span from 2008 to 2023. Among these years, 2020, 2021, and 2022 account for a large proportion of the reviews. In addition, reviews peak in January, April, May, and August during these years.

Since we are dealing with unstructured text data, N-gram analysis is essential to capture recurring phrase patterns and understand the linguistic structure of the reviews.

3.2.2 N-Gram Analysis

Create dataframe containing n-grams needed, we will be exploring unigram, bigram, and trigram.

Code
df_gram = review_df_all_deduplicate.copy()
df_gram = df_gram[['author_id','review_id','review_text']]

# unigram
df_gram['unigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=1))
# bigram
df_gram['bigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=2))
# trigram
df_gram['trigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=3))

Unigram

Most common unigrams are mostly stop words and uncleand words (with punctuations), which do not contain much information, this indicates we need further cleaning.

Code
unigrams = defaultdict(int)

for row in df_gram['unigram']:
    for word in row:
        unigrams[word] += 1

df_unigram = pd.DataFrame(sorted(unigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)
N = 100
sns.barplot(y = df_unigram[0][:N], x=df_unigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common unigrams in customer reviews')
plt.show()

Bigram

Bigrams reveal more information than the unigrams, the most common bigrams are about skin, deatiled information about customer skin is shown, for instance, we see ‘dry skin’ and ‘sensitive skin’ are mentioned very frequently. And we can infer that people who bought skincare products are likely to leave a comment.

But most bigrams still contain stop words and uncleand words (with punctuations), we need further cleaning.

Code
bigrams = defaultdict(int)

for row in df_gram['bigram']:
    for word in row:
        bigrams[word] += 1

df_bigram = pd.DataFrame(sorted(bigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)

sns.barplot(y = df_bigram[0][:N], x=df_bigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common bigrams in customer reviews')
plt.show()

Trigram

Most common trigrams are also about skin, from the most common trigram we see customers care long-lasting effects of the prodcuts.

The same stop words and uncleand words issues exist in this plot, we need further cleaning.

Code
trigrams = defaultdict(int)

for row in df_gram['trigram']:
    for word in row:
        trigrams[word] += 1

df_trigram = pd.DataFrame(sorted(trigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)

sns.barplot(y = df_trigram[0][:N], x=df_trigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common trigrams in customer reviews')
plt.show()

This is enough for EDA, now as the above N-Gram analysis indicated, we proceed to clean texts before word embedding.

4. Text Cleaning

4.1 Language Detection

This dataset contains multiple language product reviews, examples are Spanish, French, Chinese, and etc., so our first thing to do is to detect English-only reviews.

We will use two language detection tools: langdetect and langid, the first one will do the detection work, and the second one will validate the detection results. Detailed scripts are omitted for brevity, interesting reader can find it here.

Here we directly import the processed dataset.

Code
review_df_all_deduplicate = pd.read_csv("dataset/review_df_all_deduplicate_english.csv", index_col = 0)

4.2 Lowercase Letter

Code
review_df_all_deduplicate['review_text'] = review_df_all_deduplicate['review_text'].str.lower()

4.3 Standardize Unicode Character

After detecting the non-English reviews, there are other unicode variation of standard punctuation marks,e.g. ‘!’ \(\rightarrow\) ‘!’, we clean them by mapping them to standard English punctuation characters.

We first use unicodedata library to systematically normalize the variation, then utilize LLM to detect remaining variation and list the corresponding character mapping, the detail of using LLM is ommited.

Code
def clean_unicode(text):
    text = unicodedata.normalize('NFKC', text)

    return text

def detect_anomalies(text):
    anomalies = []
    for char in text:
        cp = ord(char)
        # unicode point for ASCII (0-127) 
        if cp < 128:
            continue
            
        category = unicodedata.category(char)
        # category of unicode: Punctuation, other (PO)
        if category in ['Po']:
            anomalies.append(f"{char}") # ({hex(cp)})
            
    return ",".join(anomalies) if anomalies else None

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text'].map(clean_unicode)
Code
a = review_df_all_deduplicate['review_text_cleaned'].apply(detect_anomalies)
# export for LLM to detect
a[a.notna()].to_csv('dataset/special_punctuation.csv', encoding="utf-8-sig", index=True)
Code
def clean_unicode(text):
    punctuation_map = {
        '。': '.',        
        '、': ',',   
        '·': '.',         
        '・': '.',       
        '¡': '!',      
        '¿': '?',        
        '،': ',',  
        }
    
    unmappable_symbols = ['•', '⁃']

    for old, new in punctuation_map.items():
        text = re.sub(old, new, text)

    for char in unmappable_symbols:
        text = re.sub(char, ' ', text)

    return text

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].map(clean_unicode)

4.4 Non-ASCII Character

This is similar to unicode cleaning, here we will remove those unmeaningful non-ASCII characters.

Non-ASCII characters are such as non-English characters ‘å’, emojis ‘🙂’, and the characters that cannot be displayed ‘028’. However, we will only remain emojis, since the later embedding model can identify and convert emojis to correct numerical vectors.

Code
# English letters + number + ASCII punctuation
ascii_allowed = r"[A-Za-z0-9\s\.,!?;:'\"()\-\[\]{}]"

emoji_pattern = (
    r"[\U0001F1E0-\U0001F1FF"  # Flags
    r"\U0001F300-\U0001F5FF"   # Symbols & pictographs
    r"\U0001F600-\U0001F64F"   # Emoticons
    r"\U0001F680-\U0001F6FF"   # Transport & map symbols
    r"\U0001F700-\U0001F77F"   # Alchemical symbols
    r"\U0001F780-\U0001F7FF"   # Geometric symbols
    r"\U0001F800-\U0001F8FF"   # Supplemental arrows
    r"\U0001F900-\U0001F9FF"   # Supplemental symbols & pictographs
    r"\U0001FA00-\U0001FA6F"   # Chess, symbols
    r"\U0001FA70-\U0001FAFF"   # Emoji components
    r"\U00002702-\U000027B0"   # Dingbats
    r"\U000024C2-\U0001F251"   # Enclosed characters
 r"]"
)

allowed = f"(?:{ascii_allowed}|{emoji_pattern})"

def clean_text_keep_emoji(text):
    return "".join(char if re.match(allowed, char) else "" for char in text)

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].apply(lambda x: clean_text_keep_emoji(x))
Code
punc_lis_remain = []

for text in review_df_all_deduplicate['review_text_cleaned']:
    punc_lis_remain.extend(re.findall(r"[^\w\s]", text))

punc_lis_remain = sorted(set(punc_lis_remain), key=lambda ch: ord(ch))
punc_lis_remain = [p for p in punc_lis_remain if not p.isascii()]

print('Non-ASCII characters remained: ', punc_lis_remain)
Non-ASCII characters remained:  ['─', '╥', '■', '▪', '▫', '▰', '▶', '◇', '◡', '◼', '☀', '☁', '★', '☆', '☝', '☬', '☹', '☺', '♀', '♂', '♡', '♥', '♻', '♾', '⚗', '⚜', '⚠', '⚡', '⚪', '⛅', '⛑', '✅', '✈', '✋', '✌', '✓', '✔', '✖', '✦', '✨', '✳', '❄', '❌', '❕', '❗', '❣', '❤', '➕', '➖', '➡', '➾', '➿', '⠀', '⬇', '⭐', '〈', '《', '》', '〜', '〰', '\uf04c', '\uf0a7', '︅', '︎', '️', '\ufeff', '', '�', '🅶', '🅷', '🅼', '🅾', '🆂', '🆃', '🆄', '🆈', '🆓', '🆙', '🌊', '🌙', '🌞', '🌟', '🌫', '🌱', '🌸', '🌹', '🌺', '🌿', '🍂', '🍃', '🍅', '🍉', '🍊', '🍋', '🍌', '🍒', '🍓', '🍞', '🍬', '🍯', '🎀', '🎄', '🎉', '🏆', '🏻', '🏼', '🏽', '🏾', '🐐', '🐘', '🐝', '🐣', '👀', '👁', '👄', '👋', '👌', '👍', '👎', '👏', '👣', '👩', '👶', '💀', '💄', '💅', '💋', '💓', '💔', '💕', '💖', '💗', '💘', '💙', '💚', '💛', '💜', '💞', '💡', '💥', '💦', '💨', '💫', '💯', '📌', '📍', '📦', '🔥', '🔵', '🖤', '😀', '😁', '😂', '😃', '😄', '😅', '😆', '😉', '😊', '😌', '😍', '😏', '😒', '😓', '😔', '😕', '😖', '😘', '😚', '😝', '😞', '😠', '😢', '😩', '😪', '😫', '😬', '😭', '😮', '😲', '😳', '😶', '😻', '🙁', '🙂', '🙃', '🙄', '🙌', '🙏', '🛍', '🛑', '🛒', '🤌', '🤍', '🤎', '🤐', '🤓', '🤔', '🤗', '🤞', '🤡', '🤢', '🤣', '🤤', '🤦', '🤨', '🤩', '🤪', '🤭', '🤮', '🤯', '🤷', '🥑', '🥰', '🥲', '🥳', '🥴', '🥶', '🥹', '🥺', '🦝', '🦲', '🧖', '🧡', '🧴', '🧼', '🧿', '🩸', '🪄', '🫠', '🫣', '🫰', '🫶']
Code
non_emoji_list = ['─', '╥', '■', '▪', '▫', '▰', '▶', '◇', '◡', '◼', 
                  '〈', '《', '》', '〜', '〰', 
                  '✦', '✳', '⠀', '︅', '︎', '️', '\ufeff', '', '\uf04c', '\uf0a7']

pat = "[" + re.escape("".join(non_emoji_list)) + "]"
review_df_all_deduplicate["review_text_cleaned"] = review_df_all_deduplicate["review_text_cleaned"].str.replace(pat, "", regex=True)

punc_lis_remain = []
for text in review_df_all_deduplicate['review_text_cleaned']:
    punc_lis_remain.extend(re.findall(r"[^\w\s]", text))

punc_lis_remain = sorted(set(punc_lis_remain), key=lambda ch: ord(ch))
punc_lis_remain = [p for p in punc_lis_remain if not p.isascii()]

print('Non-ASCII characters remained after cleaning: ', punc_lis_remain)
Non-ASCII characters remained after cleaning:  ['☀', '☁', '★', '☆', '☝', '☬', '☹', '☺', '♀', '♂', '♡', '♥', '♻', '♾', '⚗', '⚜', '⚠', '⚡', '⚪', '⛅', '⛑', '✅', '✈', '✋', '✌', '✓', '✔', '✖', '✨', '❄', '❌', '❕', '❗', '❣', '❤', '➕', '➖', '➡', '➾', '➿', '⬇', '⭐', '�', '🅶', '🅷', '🅼', '🅾', '🆂', '🆃', '🆄', '🆈', '🆓', '🆙', '🌊', '🌙', '🌞', '🌟', '🌫', '🌱', '🌸', '🌹', '🌺', '🌿', '🍂', '🍃', '🍅', '🍉', '🍊', '🍋', '🍌', '🍒', '🍓', '🍞', '🍬', '🍯', '🎀', '🎄', '🎉', '🏆', '🏻', '🏼', '🏽', '🏾', '🐐', '🐘', '🐝', '🐣', '👀', '👁', '👄', '👋', '👌', '👍', '👎', '👏', '👣', '👩', '👶', '💀', '💄', '💅', '💋', '💓', '💔', '💕', '💖', '💗', '💘', '💙', '💚', '💛', '💜', '💞', '💡', '💥', '💦', '💨', '💫', '💯', '📌', '📍', '📦', '🔥', '🔵', '🖤', '😀', '😁', '😂', '😃', '😄', '😅', '😆', '😉', '😊', '😌', '😍', '😏', '😒', '😓', '😔', '😕', '😖', '😘', '😚', '😝', '😞', '😠', '😢', '😩', '😪', '😫', '😬', '😭', '😮', '😲', '😳', '😶', '😻', '🙁', '🙂', '🙃', '🙄', '🙌', '🙏', '🛍', '🛑', '🛒', '🤌', '🤍', '🤎', '🤐', '🤓', '🤔', '🤗', '🤞', '🤡', '🤢', '🤣', '🤤', '🤦', '🤨', '🤩', '🤪', '🤭', '🤮', '🤯', '🤷', '🥑', '🥰', '🥲', '🥳', '🥴', '🥶', '🥹', '🥺', '🦝', '🦲', '🧖', '🧡', '🧴', '🧼', '🧿', '🩸', '🪄', '🫠', '🫣', '🫰', '🫶']

4.5 Abbreviation

Two types of abbreviation:

  1. Contraction e.g.: i’ve and don’t, …

  2. Abbreviation:

    • slang/Internet slang, e.g.: idk, lol, omg, …

    • specific terms, e.g: aha, bha, …

For type 1:

We detect 0.

Code
# previous n gram generator function, we slightly fix condition, don't let stopwords removed
def generate_ngrams_1(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '']
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]
Code
df_gram_abbreviation = pd.DataFrame(review_df_all_deduplicate['review_text_cleaned'])
df_gram_abbreviation['unigram'] = df_gram_abbreviation['review_text_cleaned'].apply(lambda x: generate_ngrams_1(x, n_gram=1))

all_tokens = (tok for lst in df_gram_abbreviation["unigram"] for tok in lst)

pattern = re.compile(r"[']")
abbr_tokens = [t for t in all_tokens if pattern.search(t)]

print('Number of tokens that contain \': ', len(abbr_tokens))
Number of tokens that contain ':  0

For type 2:

Code
df_gram_abbreviation = pd.DataFrame(review_df_all_deduplicate['review_text_cleaned'])
df_gram_abbreviation['unigram'] = df_gram_abbreviation['review_text_cleaned'].apply(lambda x: generate_ngrams(x, n_gram=1))

all_tokens = (tok for lst in df_gram_abbreviation["unigram"] for tok in lst)
short_tokens = [t for t in all_tokens if len(t) < 4]
short_freq = Counter(short_tokens)

print('Top 300 frequent abbreviation: ')
# top 300
short_freq.most_common(300) 
Top 300 frequent abbreviation: 
[('use', 272304),
 ('ive', 171096),
 ('dry', 159353),
 ('im', 154930),
 ('one', 127205),
 ('it.', 124566),
 ('eye', 78190),
 ('see', 77121),
 ('try', 74632),
 ('day', 67153),
 ('got', 64184),
 ('oil', 59045),
 ('now', 57219),
 ('bit', 56869),
 ('put', 49486),
 ('way', 48134),
 ('say', 44528),
 ('me.', 42047),
 ('-', 41424),
 ('it!', 41128),
 ('lot', 40673),
 ('new', 40599),
 ('lip', 40348),
 ('go', 40282),
 ('buy', 39846),
 ('it,', 34403),
 ('two', 32847),
 ('2', 31272),
 ('3', 25486),
 ('far', 24733),
 ('bad', 22231),
 ('ill', 21496),
 ('id', 19492),
 ('job', 18429),
 ('gel', 18352),
 ('big', 17864),
 ('top', 17829),
 ('.', 17445),
 ('saw', 17114),
 ('may', 16945),
 ('red', 16641),
 ('spf', 16066),
 ('c', 15669),
 ('5', 15594),
 ('fan', 15298),
 ('yet', 15091),
 ('4', 14276),
 ('up.', 13624),
 ('let', 13066),
 ('end', 12823),
 ('add', 12805),
 ('due', 12034),
 ('me,', 11942),
 ('on.', 11419),
 ('!', 11052),
 ('rid', 10846),
 ('sun', 10810),
 ('jar', 10609),
 ('(i', 10440),
 (',', 9992),
 ('run', 9788),
 ('old', 9731),
 ('10', 9385),
 ('rub', 8722),
 ('bed', 8609),
 ('1', 8498),
 ('mix', 7952),
 (':)', 7738),
 ('100', 7611),
 ('ago', 7481),
 ('set', 6567),
 ('is.', 6520),
 ('box', 6383),
 ('on,', 6206),
 ('do.', 6172),
 ('me!', 6158),
 ('in.', 5749),
 ('non', 5487),
 ('6', 5468),
 ('up,', 5288),
 ('30', 5086),
 ('tan', 5078),
 ('dr.', 4841),
 ('de', 4401),
 ('bc', 4292),
 ('ok', 4279),
 ('2-3', 4275),
 ('la', 4225),
 ('to.', 4189),
 ('pay', 4146),
 ('per', 4145),
 ('hot', 4060),
 ('wow', 4003),
 ('sit', 3997),
 ('20', 3982),
 ('so,', 3861),
 ('def', 3856),
 ('7', 3826),
 ('ran', 3777),
 ('oh', 3601),
 ('tea', 3588),
 ('mom', 3573),
 ('t', 3548),
 ('yes', 3426),
 (':(', 3407),
 ('15', 3395),
 ('wet', 3222),
 ('is,', 3218),
 ('ton', 3134),
 ('pad', 3058),
 ('pat', 2932),
 ('age', 2922),
 ('kit', 2788),
 ('8', 2671),
 ('up!', 2638),
 ('in,', 2570),
 ('tag', 2568),
 ('50', 2516),
 ('omg', 2448),
 ('(', 2443),
 ('sad', 2431),
 ('cut', 2425),
 ('!!', 2424),
 ('aha', 2407),
 ('mer', 2406),
 ('--', 2378),
 ('fun', 2378),
 ('ole', 2373),
 ('dab', 2358),
 ('bag', 2350),
 ('pm', 2343),
 ('40', 2328),
 ('12', 2311),
 ('u', 2288),
 ('lol', 2281),
 ('fix', 2269),
 ('low', 2267),
 ('bar', 2263),
 ('pea', 2164),
 ('go.', 2161),
 ('ptr', 2055),
 ('3-4', 2043),
 ('us', 2034),
 ('pop', 2019),
 ('ok.', 2018),
 ('air', 1980),
 ('1-2', 1960),
 ('be.', 1936),
 ('hit', 1925),
 ('...', 1872),
 ('bha', 1846),
 ('spa', 1838),
 ('do,', 1836),
 ('cc', 1804),
 ('so.', 1780),
 ('fit', 1780),
 ('soo', 1775),
 ('dr', 1771),
 ('w', 1755),
 ('fab', 1731),
 ('lid', 1715),
 ('aid', 1698),
 ('oz', 1688),
 ('ask', 1688),
 ('odd', 1646),
 ('tip', 1644),
 ('tub', 1605),
 ('tad', 1584),
 ('bb', 1567),
 ('(im', 1565),
 ('3rd', 1534),
 ('cap', 1533),
 ('con', 1493),
 ('is!', 1491),
 ('jet', 1489),
 ('sat', 1470),
 ('ha', 1452),
 ('(it', 1451),
 ('2nd', 1444),
 ('on!', 1439),
 ('2x', 1434),
 ('30s', 1428),
 ('key', 1427),
 ('mid', 1424),
 ('ok,', 1392),
 ('boy', 1387),
 ('of.', 1372),
 ('soy', 1362),
 ('zit', 1353),
 ('ren', 1352),
 ('(or', 1313),
 ('god', 1303),
 ('45', 1294),
 ('via', 1289),
 ('win', 1287),
 ('lag', 1286),
 ('(my', 1271),
 ('!!!', 1267),
 ('..', 1237),
 ('(as', 1197),
 ('vit', 1195),
 (':', 1190),
 ('24', 1190),
 ('to,', 1176),
 ('tin', 1173),
 ('ur', 1167),
 ('dot', 1158),
 ('(a', 1154),
 ('ten', 1131),
 ('no.', 1129),
 ('60', 1115),
 ('20s', 1103),
 ('pot', 1093),
 ('bye', 1089),
 ('(in', 1082),
 ('idk', 1067),
 ('25', 1065),
 ('n', 1059),
 ('1st', 1058),
 ('fav', 1047),
 ('e', 1045),
 ('min', 1029),
 ('14', 997),
 ('(if', 985),
 ('tho', 983),
 ('(no', 982),
 ('six', 978),
 ('hg', 974),
 ('row', 965),
 ('jaw', 961),
 ('dew', 952),
 ('❤', 949),
 ('c.', 937),
 ('c,', 917),
 ('etc', 911),
 ('die', 894),
 ('bay', 874),
 ('vs', 867),
 ('vib', 864),
 ('do!', 855),
 (')', 854),
 ('4-5', 846),
 ('am.', 845),
 ('duo', 831),
 ('to!', 819),
 ('eat', 803),
 ('2.', 797),
 ('90', 794),
 ('dip', 791),
 ('pro', 781),
 ('man', 780),
 ('re', 774),
 ('9', 752),
 ('jlo', 748),
 ('1.', 745),
 ('lil', 743),
 ('35', 735),
 ('ups', 720),
 ('34', 701),
 ('pm.', 697),
 ('vox', 695),
 ('80', 692),
 ('go!', 692),
 ('son', 691),
 ('st.', 682),
 ('s', 680),
 ('4th', 677),
 ('ph', 676),
 ('tlc', 675),
 ('13', 664),
 ('tap', 644),
 ('yo', 637),
 ('uv', 634),
 ('itd', 629),
 ('cuz', 628),
 ('(at', 623),
 ('28', 618),
 (';)', 617),
 ('40s', 611),
 ('mad', 609),
 ('raw', 609),
 ('it?', 609),
 ('18', 604),
 ('o', 602),
 ('3x', 601),
 ('it)', 593),
 ('23', 578),
 ('me)', 576),
 ('(so', 576),
 ('guy', 563),
 ('no,', 562),
 ('pre', 557),
 ('ceo', 557),
 ('0', 556),
 ('gym', 553),
 ('go,', 550),
 ('oat', 550),
 ('55', 549),
 ('ice', 517),
 ('met', 509)]

We standardized abbreviations by examining the top 300 most frequent tokens with a length of less than 4 characters. While these thresholds (Top-300, Length < 4) are heuristic, this approach effectively captures the majority of high-frequency cases without requiring exhaustive manual review.

Create correction mapping.

Code
correct_map = {
    r"\bive\b": "i have",
    r"\bim\b": "i am",
    r"\bill\b": "i will",
    r"\bspf\b": "sun protection factor",
    r"\bbc\b": "because",
    r"\bdr\.\b": "doctor",
    r"\bdr\b": "doctor",
    r"\bdef\b": "definitely",
    r"\bptr\b": "peter thomas roth",
    r"\bbha\b": "beta hydroxy acid",
    r"\baha\b": "alpha hydroxy acid",
    r"\bbb\b": "beauty balm",
    r"\bha\b": "hyaluronic acid",
    r"\bomg\b": "oh my god",
    r"\blol\b": "laugh out loud",
    r"\bidk\b": "i do not know",
    r"\bur\b": "your",
    r"\btho\b": "though",
    r"\bfav\b": "favorite",
    r"\bceo\b": "sunday riley",
    r"\bhg\b": "holy grail",
    r"\betc\b": "et cetera",
    r"\bvib\b": "very important beauty insider",
    r"\bmin\b": "minutes",
    r"\bph\b": "potential of hydrogen",
    r"\bmeh\b": "eh",
    r"\bfab\b": "first aid beauty",
    r"\buv\b": "ultraviolet",
    r"\bitd\b": "it would",
    r"\bcuz\b": "because",
    r"\bvit\b": "vitamin",
    r"\b30s\b": "30 seconds",
    r"\b20s\b": "20 seconds",
    r"\b40s\b": "40 seconds",
    r"\b1st\b": "first",
    r"\b2nd\b": "second",
    r"\b3rd\b": "third",
    r"\b4th\b": "fourth",
    r"\b3x\b": "3 times",
    r"\blil\b": "little",
    
    # latter added
    r"\bwouldnt\b": "would not",
    r"\bhavec\b": "have",
    r"\bbomb.com\b": "great",
    r"\byrs\b": "years",
    r"\bfyi\b": "for your information",
    r"\brn\b": "right now",
    r"\bsoo\b": "so",
    r"\bgonna\b": "going to",
    r"\byoull\b": "you will",

    r"(?<=\s)\(im(?=\s)": "i am",
    r"(?<=\s)\(if(?=\s)": "if",
    r"(?<=\s)\(no(?=\s)": "no",
    r"(?:(?<=\s)|^)\bit\)(?=\s|$)": "it",
    r"(?:(?<=\s)|^)\bme\)(?=\s|$)": "me",
    r"(?<=\s)\(so(?=\s)": "so",
}

review_df_all_deduplicate['review_text_cleaned'] = (
    review_df_all_deduplicate['review_text_cleaned']
    .replace(correct_map,regex=True)
)

Specifically fix ‘ii’ abbrevation.

‘ii’ could be typo of ‘i’, and also could be the product name ‘sk-ii’.

Code
COMMON_VERBS_AFTER_I = {
    "love","like","think","use","used","feel","felt","have","had","am",
    "would","will","wish","want","tried","try","bought","buy",
    "recommend","ordered","order","see","saw","noticed","notice",
    "find","found","hate","dislike","prefer","need","needed",

    # latter added
    "mean", "only", "also", "heard", 
}

def fix_ii_typos(text):

    tokens = text.split()
    new_tokens = []

    for idx, tok in enumerate(tokens):
        lower_tok = tok.lower()

        if lower_tok == "ii":
            prev_tok = tokens[idx-1].lower() if idx > 0 else ""
            next_tok = tokens[idx+1].lower() if idx+1 < len(tokens) else ""

            if prev_tok.isdigit():
                new_tokens.append(tok)

            elif next_tok in COMMON_VERBS_AFTER_I:

                new_tokens.append("i")
            else:

                new_tokens.append(tok)
        else:
            new_tokens.append(tok)

    return " ".join(new_tokens)

mask = review_df_all_deduplicate['review_text_cleaned'].str.contains(r'\bii\b', regex=True)
review_df_all_deduplicate.loc[mask, 'review_text_cleaned'] = review_df_all_deduplicate.loc[mask, 'review_text_cleaned'].apply(fix_ii_typos)

4.6 URL

Clean urls in reviews.

Code
def clean_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+|\bhttps?\b') # enhanced, remove residual 'http'
    
    return url.sub('',text).strip()

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].map(clean_URL)
review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].str.replace(r"\b[\w.-]+\.com\b", "", regex=True)

4.7 Letter Repeat

Clean the words that contain multiple repeated letters. (e.g., loooooooove, soooooooo …)

  1. reduce any character sequence exceeding a length of 3 down to 2 characters to restore standard spelling.

  2. Find and analyze the top 500 most frequent words containing repeated characters to identify and handle remaining irregularities.

Code
mask = review_df_all_deduplicate['review_text_cleaned'].str.contains(r"([A-Za-z])\1{2,}", regex=True)


review_df_all_deduplicate.loc[mask, 'review_text_cleaned'] = (
    review_df_all_deduplicate.loc[mask, 'review_text_cleaned']
    .str.replace(r"([A-Za-z])\1{2,}", r"\1\1", regex=True)
)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_27984\277250214.py:1: UserWarning:

This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
Code
df_gram_repeat = review_df_all_deduplicate.loc[mask, 'review_text_cleaned'].copy()
df_gram_repeat = pd.DataFrame(df_gram_repeat)
df_gram_repeat['unigram'] = df_gram_repeat['review_text_cleaned'].apply(lambda x: generate_ngrams(x, n_gram=1))
Code
all_tokens = (tok for lst in df_gram_repeat["unigram"] for tok in lst)

short_tokens = [t for t in all_tokens if re.search(r"([A-Za-z])\1", t)] 
short_freq = Counter(short_tokens)

print('Top 500 frequent words with repeated letter: ')

# top 500
short_freq.most_common(500)
Top 500 frequent words with repeated letter: 
[('soo', 7964),
 ('really', 5194),
 ('will', 4676),
 ('feel', 3779),
 ('little', 3375),
 ('feels', 3206),
 ('good', 2891),
 ('smells', 2337),
 ('recommend', 1877),
 ('look', 1801),
 ('smell', 1734),
 ('feeling', 1667),
 ('see', 1619),
 ('well', 1548),
 ('smooth', 1532),
 ('still', 1418),
 ('need', 1384),
 ('stuff', 1285),
 ('looks', 1284),
 ('looking', 1239),
 ('actually', 1159),
 ('better', 1113),
 ('week', 1050),
 ('lovee', 1028),
 ('apply', 1025),
 ('full', 1019),
 ('pretty', 1004),
 ('difference', 996),
 ('bottle', 951),
 ('free', 930),
 ('loove', 914),
 ('weeks', 813),
 ('literally', 812),
 ('less', 796),
 ('happy', 768),
 ('keep', 755),
 ('usually', 743),
 ('looked', 736),
 ('getting', 697),
 ('small', 690),
 ('redness', 662),
 ('sunscreen', 644),
 ('especially', 601),
 ('all.', 599),
 ('good.', 586),
 ('different', 559),
 ('well.', 528),
 ('tell', 519),
 ('applying', 495),
 ('good!', 481),
 ('took', 461),
 ('finally', 446),
 ('totally', 420),
 ('applied', 402),
 ('loong', 398),
 ('smooth.', 394),
 ('overall', 388),
 ('wayy', 359),
 ('better.', 355),
 ('immediately', 352),
 ('keeps', 342),
 ('putting', 341),
 ('seen', 337),
 ('too.', 320),
 ('off.', 319),
 ('seems', 316),
 ('irritate', 316),
 ('recommended', 298),
 ('summer', 298),
 ('add', 292),
 ('waay', 277),
 ('needed', 267),
 ('recommend!', 260),
 ('all,', 257),
 ('cheeks', 251),
 ('three', 250),
 ('smooth,', 243),
 ('soon', 243),
 ('deep', 242),
 ('smaller', 241),
 ('application', 240),
 ('good,', 238),
 ('well,', 233),
 ('gotten', 231),
 ('smell.', 228),
 ('recommend.', 225),
 ('off,', 224),
 ('stuff.', 218),
 ('added', 218),
 ('normally', 216),
 ('difference.', 216),
 ('obsessed', 216),
 ('issues', 214),
 ('bottle.', 212),
 ('peel', 209),
 ('sleeping', 209),
 ('smoother', 207),
 ('cooling', 205),
 ('suuper', 197),
 ('stopped', 197),
 ('gross', 196),
 ('effective', 193),
 ('matte', 189),
 ('irritated', 188),
 ('smell,', 188),
 ('seeing', 187),
 ('too!', 184),
 ('irritation', 183),
 ('impressed', 182),
 ('seem', 178),
 ('personally', 174),
 ('sleep', 171),
 ('effect', 166),
 ('follow', 163),
 ('stuff!', 161),
 ('green', 161),
 ('cool', 160),
 ('week.', 159),
 ('disappointed', 158),
 ('matter', 153),
 ('glass', 153),
 ('waterproof', 151),
 ('essence', 151),
 ('dryness', 150),
 ('fell', 149),
 ('soothing', 147),
 ('bigger', 146),
 ('clogged', 146),
 ('supposed', 143),
 ('chapped', 142),
 ('sunscreen.', 141),
 ('seemed', 140),
 ('longg', 138),
 ('supple', 135),
 ('well!', 135),
 ('overall,', 135),
 ('smooth!', 134),
 ('weeks.', 133),
 ('basically', 133),
 ('guess', 132),
 ('issue', 129),
 ('all!', 128),
 ('peeling', 127),
 ('needs', 125),
 ('typically', 124),
 ('application.', 123),
 ('adding', 121),
 ('nighttime', 121),
 ('worried', 121),
 ('cotton', 121),
 ('keeping', 119),
 ('horrible', 118),
 ('redness,', 117),
 ('rubbing', 117),
 ('immediate', 116),
 ('suffer', 116),
 ('unless', 116),
 ('week,', 115),
 ('appearance', 114),
 ('feeling.', 113),
 ('struggle', 110),
 ('ahh', 109),
 ('struggled', 108),
 ('smelled', 108),
 ('applies', 107),
 ('flawless', 106),
 ('itll', 105),
 ('suggest', 104),
 ('irritating', 104),
 ('dull', 103),
 ('yttp', 102),
 ('better!', 101),
 ('look.', 99),
 ('effects', 98),
 ('applicator', 97),
 ('naturally', 96),
 ('followed', 96),
 ('redness.', 95),
 ('biggest', 93),
 ('stubborn', 93),
 ('sweet', 92),
 ('terrible', 90),
 ('across', 90),
 ('fully', 90),
 ('reallyy', 89),
 ('smoothly', 89),
 ('sunscreens', 89),
 ('affordable', 88),
 ('good!!', 87),
 ('gloss', 87),
 ('addition', 85),
 ('scarring', 84),
 ('setting', 83),
 ('stripping', 83),
 ('disappointed.', 82),
 ('dennis', 82),
 ('weeks,', 80),
 ('too,', 80),
 ('irritation.', 79),
 ('excess', 79),
 ('puffy', 79),
 ('reapply', 77),
 ('better,', 77),
 ('barrier', 77),
 ('amazingg', 76),
 ('initially', 76),
 ('originally', 76),
 ('essential', 76),
 ('smelling', 76),
 ('bottle,', 76),
 ('veryy', 75),
 ('beautifully', 75),
 ('appreciate', 75),
 ('ohh', 74),
 ('tanning', 74),
 ('obsessed.', 73),
 ('pills', 72),
 ('butter', 72),
 ('butt', 71),
 ('currently', 70),
 ('running', 70),
 ('happened', 69),
 ('following', 68),
 ('supple.', 68),
 ('cheeks.', 67),
 ('sunscreen,', 67),
 ('immediately.', 67),
 ('application,', 66),
 ('smoother,', 66),
 ('massage', 66),
 ('impressed.', 66),
 ('allergic', 65),
 ('effective.', 65),
 ('afford', 65),
 ('mirror', 65),
 ('feel.', 65),
 ('difficult', 65),
 ('happy.', 65),
 ('puffiness', 65),
 ('summer.', 64),
 ('worry', 64),
 ('buut', 64),
 ('wanna', 63),
 ('happen', 63),
 ('apply.', 63),
 ('squeeze', 63),
 ('stripped', 63),
 ('accutane', 62),
 ('pill', 61),
 ('difference!', 61),
 ('struggling', 61),
 ('peels', 61),
 ('free.', 61),
 ('stress', 61),
 ('annoying', 61),
 ('bottles', 61),
 ('smoothed', 60),
 ('penny.', 60),
 ('mess', 60),
 ('occasional', 59),
 ('superr', 59),
 ('smoother.', 59),
 ('adds', 59),
 ('pass', 59),
 ('dropper', 58),
 ('cheek', 57),
 ('process', 57),
 ('generally', 57),
 ('deeply', 57),
 ('fall', 57),
 ('yall', 57),
 ('amazingg.', 57),
 ('dollars', 56),
 ('bottom', 56),
 ('issues.', 56),
 ('appear', 56),
 ('obsessed!', 56),
 ('berry', 56),
 ('looved', 55),
 ('employee', 54),
 ('boost', 54),
 ('sk-ii', 53),
 ('beginning', 53),
 ('tanner', 53),
 ('lovee.', 53),
 ('jelly', 53),
 ('glossy', 53),
 ('omgg', 52),
 ('stuff,', 52),
 ('rubbed', 51),
 ('micellar', 51),
 ('gotta', 51),
 ('free,', 51),
 ('cc', 51),
 ('suggested', 50),
 ('verry', 50),
 ('hooked', 50),
 ('sitting', 50),
 ('supergoop', 50),
 ('telling', 50),
 ('agree', 49),
 ('loovvee', 49),
 ('middle', 49),
 ('excellent', 48),
 ('eventually', 48),
 ('smoothing', 48),
 ('current', 48),
 ('yummy', 48),
 ('itt', 47),
 ('loovee', 47),
 ('smaller.', 47),
 ('oiliness', 47),
 ('specifically', 47),
 ('hopefully', 47),
 ('huuge', 47),
 ('ball', 47),
 ('till', 46),
 ('off!', 46),
 ('smell!', 46),
 ('amazingg!', 46),
 ('penny', 45),
 ('whipped', 45),
 ('umm', 45),
 ('strawberry', 45),
 ('biossance', 45),
 ('lovee!', 44),
 ('smaller,', 44),
 ('apply,', 44),
 ('irritated.', 44),
 ('applying.', 44),
 ('mess.', 44),
 ('recommend!!', 44),
 ('dryness.', 44),
 ('needed.', 44),
 ('yellow', 44),
 ('disappoint.', 44),
 ('feet', 43),
 ('reeally', 43),
 ('(especially', 43),
 ('teeth', 43),
 ('school', 42),
 ('odd', 42),
 ('popped', 42),
 ('missing', 41),
 ('good!!!', 41),
 ('smallest', 41),
 ('carry', 41),
 ('impressed!', 41),
 ('buttery', 41),
 ('smooths', 40),
 ('call', 40),
 ('happy!', 40),
 ('hooked.', 40),
 ('scoop', 40),
 ('look,', 40),
 ('vanilla', 40),
 ('suffering', 39),
 ('soothing.', 39),
 ('soothes', 39),
 ('soon.', 39),
 ('effect.', 39),
 ('sunscreen!', 39),
 ('called', 39),
 ('collection', 39),
 ('dramatically', 38),
 ('supple,', 38),
 ('really,', 38),
 ('recommending', 38),
 ('woww', 38),
 ('miss', 38),
 ('sheet', 37),
 ('penny!', 37),
 ('happens', 37),
 ('cheeks,', 37),
 ('planning', 37),
 ('reaally', 37),
 ('possible', 37),
 ('irritates', 37),
 ('dryness,', 36),
 ('suffered', 36),
 ('bottle!', 36),
 ('effective,', 36),
 ('stuff!!', 36),
 ('thankfully', 36),
 ('issue.', 36),
 ('yess', 35),
 ('aand', 35),
 ('messy', 35),
 ('peel.', 35),
 ('practically', 35),
 ('willing', 35),
 ('inflammation', 35),
 ('massaging', 35),
 ('korres', 35),
 ('mm', 34),
 ('goodness', 34),
 ('disappear', 34),
 ('gross.', 34),
 ('deff', 34),
 ('pillow', 34),
 ('soothe', 34),
 ('cells', 34),
 ('looking.', 34),
 ('hmm', 34),
 ('difference,', 33),
 ('little.', 33),
 ('disappeared', 33),
 ('million', 33),
 ('different.', 33),
 ('loove!', 33),
 ('lovvee', 33),
 ('flawless.', 33),
 ('tree', 33),
 ('pull', 33),
 ('correcting', 33),
 ('feeling,', 33),
 ('ooh', 33),
 ('additional', 33),
 ('noo', 32),
 ('thrilled', 32),
 ('applying,', 32),
 ('upper', 32),
 ('bummed', 32),
 ('comment', 32),
 ('sheen', 32),
 ('happier', 32),
 ('steep', 31),
 ('summer,', 31),
 ('brightness', 31),
 ('needing', 31),
 ('combooily', 31),
 ('clogging', 31),
 ('terrible.', 31),
 ('effort', 31),
 ('specially', 31),
 ('lovve', 30),
 ('sorry', 30),
 ('effects.', 30),
 ('fee', 30),
 ('filled', 30),
 ('recommend,', 30),
 ('waayy', 30),
 ('teeny', 30),
 ('collagen', 30),
 ('impossible', 29),
 ('ehh', 29),
 ('loove.', 29),
 ('opportunity', 29),
 ('drastically', 29),
 ('weekly', 29),
 ('looves', 29),
 ('appears', 29),
 ('wonderfully', 29),
 ('gloss.', 29),
 ('disappoint!', 29),
 ('grabbed', 28),
 ('sheer', 28),
 ('thee', 28),
 ('reccomend', 28),
 ('necessarily', 28),
 ('loonngg', 28),
 ('sleep.', 28),
 ('refill', 28),
 ('superfood', 28),
 ('opposed', 28),
 ('bathroom', 28),
 ('hugee', 28),
 ('letting', 28),
 ('freeproduct', 28),
 ('feels.', 27),
 ('irritation,', 27),
 ('commented', 27),
 ('affordable.', 27),
 ('regardless', 27),
 ('andd', 27),
 ('looks.', 27),
 ('applied.', 27),
 ('goop', 27),
 ('pulling', 27),
 ('wallet', 27),
 ('popping', 26),
 ('disappointing', 26),
 ('affect', 26),
 ('disappears', 26),
 ('yess!', 26),
 ('week!', 26),
 ('press', 26),
 ('balls', 26),
 ('missed', 26),
 ('somerville', 26),
 ('ii', 26),
 ('sulwhasoo', 26),
 ('mattifying', 25),
 ('accidentally', 25),
 ('occasionally', 25),
 ('soo,', 25),
 ('lott', 25),
 ('soothed', 25),
 ('dryy', 25),
 ('professional', 25)]

Create correction mapping.

Code
correct_map_1 = {
    r"\bsoo\b": "so",
    r"\bsooo\b": "so",
    r"\bsoooo\b": "so",
    r"\bsooooo\b": "so",

    r"\blovee\b": "love",
    r"\bloove\b": "love",
    r"\bloong\b": "long",
    r"\bwayy\b": "way",
    r"\bgonna\b": "going to",
    r"\bwaay\b": "way",
    r"\bsuuper\b": "super",
    r"\byttp\b": "youth to the people",
    r"\byoull\b": "you will",
    r"\blongg\b": "long",
    r"\bnighttime\b": "night time",    
    r"\bitll\b": "it will",
    r"\breallyy\b": "really", 
    r"\bwanna\b": "want to", 
    r"\bohh\b": " ", 
    r"\bbuut\b": "but", 
    r"\byall\b": "you all",
    r"\bloovvee\b": "love",    
    r"\blovee\.": "love",      
    r"\bamazingg\b": "amazing",       
    r"\bwoww\b": "wow",     
    r"\bsuperr\b": "super",
    r"\baand\b": "and",
    r"\bamazingg\.": "amazing", 
    r"\bnoo\b": "no", 
    r"\bgotta\b": "got to", 
    r"\bdeff\b": "definitely", 
    r"\bveryy\b": "very", 
    r"\bandd\b": "and",
    r"\bomgg\b": "oh my god",
    r"\bloove\.": "love",
    r"\bloovee\b": "love",
    r"\blooved\b": "loved",
    r"\byess\b": "yes",
    r"\byall,": "you all",
    r"\bhuuge\b": "huge",
    r"\byess!": "yes",
    r"\blovvee\b": "love",
    r"\bwaayy\b": "way",
    r"\bloove!\b": "love",
    r"\bbutt\b": "but",
    r"\bloonngg\b": "long",
    r"\bitt\.": "it",
    r"\bforeverr\b": "forever",
    r"\bthicc\b": "thick",
    r"\bverry\b": "very",
    r"\bitt\b": "it",
    r"\blonng\b": "long",
    r"\beverr\b": "ever",
    r"\booh\b": " ", 
    r"\blovedd\b": "loved",
    r"\blott\b": "lot",
    r"\bsoo,": "so",
    r"\bmuchh\b": "much",
    r"\byess!!!": "yes",
    r"\bwoow\b": " ",
    r"\blooves\b": "loves",
    r"\byall\.": "you all",
    r"\bitt!": "it",
    r"\breaally\b": "really",
    r"\bamaazing\.": "amazing", 
    r"\bverryy\b": "very",
    r"\bhella\b": "very",
    r"\bamazingg!": "amazing", 
    r"\bloovvee\b": "love",

    # later added
    r"\bahh\b": " ", 
    r"\bumm\b": " ",     
    r"\blovee!": "love",
    r"\byess\b": "yes",
    r"\bhmm\b": " ", 
    r"\bcombooily\b": "combo oily", 
    r"\blovve\b": "love",   
    r"\behh\b": " ",  
    r"\bloonngg\b": "long",    
    r"\bhugee\b": "huge",  
    r"\bfreeproduct\b": "free product",    
    r"\bandd\b": "and",
    r"\byess!": "yes",
    r"\blott\b": "lot",
}

review_df_all_deduplicate['review_text_cleaned'] = (
    review_df_all_deduplicate['review_text_cleaned']
    .replace(correct_map_1,regex=True)
)

4.8 Punctuation

Remove punctuations.

Code
pattern = "[" + re.escape(string.punctuation) + "]" 
review_df_all_deduplicate["review_text_cleaned"] = (
    review_df_all_deduplicate["review_text_cleaned"].str.replace(pattern, " ", regex=True)
)

punc_lis_remain = []

for text in review_df_all_deduplicate['review_text_cleaned']:
    punc_lis_remain.extend(re.findall(r"[^\w\s]", text))

punc_lis_remain = sorted(set(punc_lis_remain), key=lambda ch: ord(ch)) 
print('Characters (only emoji) left in the reviews: ', '\n', punc_lis_remain)
Characters (only emoji) left in the reviews:  
 ['☀', '☁', '★', '☆', '☝', '☬', '☹', '☺', '♀', '♂', '♡', '♥', '♻', '♾', '⚗', '⚜', '⚠', '⚡', '⚪', '⛅', '⛑', '✅', '✈', '✋', '✌', '✓', '✔', '✖', '✨', '❄', '❌', '❕', '❗', '❣', '❤', '➕', '➖', '➡', '➾', '➿', '⬇', '⭐', '�', '🅶', '🅷', '🅼', '🅾', '🆂', '🆃', '🆄', '🆈', '🆓', '🆙', '🌊', '🌙', '🌞', '🌟', '🌫', '🌱', '🌸', '🌹', '🌺', '🌿', '🍂', '🍃', '🍅', '🍉', '🍊', '🍋', '🍌', '🍒', '🍓', '🍞', '🍬', '🍯', '🎀', '🎄', '🎉', '🏆', '🏻', '🏼', '🏽', '🏾', '🐐', '🐘', '🐝', '🐣', '👀', '👁', '👄', '👋', '👌', '👍', '👎', '👏', '👣', '👩', '👶', '💀', '💄', '💅', '💋', '💓', '💔', '💕', '💖', '💗', '💘', '💙', '💚', '💛', '💜', '💞', '💡', '💥', '💦', '💨', '💫', '💯', '📌', '📍', '📦', '🔥', '🔵', '🖤', '😀', '😁', '😂', '😃', '😄', '😅', '😆', '😉', '😊', '😌', '😍', '😏', '😒', '😓', '😔', '😕', '😖', '😘', '😚', '😝', '😞', '😠', '😢', '😩', '😪', '😫', '😬', '😭', '😮', '😲', '😳', '😶', '😻', '🙁', '🙂', '🙃', '🙄', '🙌', '🙏', '🛍', '🛑', '🛒', '🤌', '🤍', '🤎', '🤐', '🤓', '🤔', '🤗', '🤞', '🤡', '🤢', '🤣', '🤤', '🤦', '🤨', '🤩', '🤪', '🤭', '🤮', '🤯', '🤷', '🥑', '🥰', '🥲', '🥳', '🥴', '🥶', '🥹', '🥺', '🦝', '🦲', '🧖', '🧡', '🧴', '🧼', '🧿', '🩸', '🪄', '🫠', '🫣', '🫰', '🫶']

4.9 Random Words

Finally, notice the existence of random and meaningless words that have extremely long length, for example:

Code
review_df_all_deduplicate.loc[80989, 'review_text_cleaned']
'yea i agree yes same mhm jtssngdbkgiiycycitctiheckljjioo'

We applied a heuristic threshold of 23 characters, any individual token exceeding this limit was identified as an anomaly and removed.

Code
def detect_gibberish(w, min_len = 30):    
    if len(w) > min_len:
        return True

    return False

def gibberish_text(text):
    words = text.split()
    return any(detect_gibberish(w) for w in words)


mask = review_df_all_deduplicate['review_text_cleaned'].apply(gibberish_text)

df_process = review_df_all_deduplicate.loc[mask, 'review_text_cleaned']
df_process = pd.DataFrame(df_process)

def drop_long_word(text, threshold = 23):
    emoji = {"✅", "❌"} 
    tokens = text.split()
    kept = []

    for tok in tokens:
        if len(tok) >= threshold:
            if any(e in tok for e in emoji):
                kept.append(tok)

        else:
            kept.append(tok)

    return " ".join(kept)


df_process['review_text_cleaned'] = df_process['review_text_cleaned'].apply(drop_long_word)
review_df_all_deduplicate.loc[df_process.index, 'review_text_cleaned'] = df_process['review_text_cleaned']

This concludes our text cleaning pipeline. While achieving a perfectly clean corpus is impractical given the large volume of unstructured text, this process has substantially reduced noise and standardized the data.

Code
review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].str.replace(r"\s+", " ", regex=True)
review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].str.strip()
Code
# some text entries are empty after cleaning, we remove them
idx_length_0 = review_df_all_deduplicate[review_df_all_deduplicate['review_text_cleaned'].map(len) == 0].index

mask = np.isin(review_df_all_deduplicate.index, idx_length_0)

review_df_all_deduplicate = review_df_all_deduplicate.loc[review_df_all_deduplicate.index.values[~ mask]]

The actual texts to be clustered will be within recent 4 years, the past comments over 4 years would be too outdated to current product development, and therefore they are considered as noise.

Code
review_df_all_deduplicate = review_df_all_deduplicate[review_df_all_deduplicate['submission_year'] >= 2020]
review_df_all_deduplicate.to_csv('dataset/review_df_all_deduplicate_processed.csv', index=True)

5. Sentence Embedding

We use Sentence-BERT to generate sentence embeddings. Unlike traditional keyword-based methods (e.g. TF-IDF) or static embeddings (e.g. Fasttext), SBERT captures deep contextual meaning. For instance, it understands that “not good” is the opposite of “good,” whereas keyword-based models might treat them similarly. Therefore it is the most suitable model to precisely convert sentences into vectors given our current computational resources and data.

The implementation details are omitted for brevity. Since this model is a transformer-based model that benefits significantly from GPU acceleration, the full execution script is hosted on Google Colab and can be found here.

We directly import the transformed vectors.

Code
df_sbert = pd.read_csv("dataset/sbert_embeddings_2.csv", index_col=0)

6. Clustering

Given our dataset size of over 800,000 reviews, performing t-SNE to firstly visualize clustering structure of the dataset is computationally prohibitive, we use UMAP instead, since:

  1. UMAP is much faster especially on large datasets.

  2. UMAP better maintains the relationships between clusters, not just points within clusters, offering a more accurate representation of the high-dimensional data’s overall shape.

6.1 UMAP

We first draw a set of UMAP plots with different combinations of n_neighbors and min_dist to avoid instability caused by parameters. Candidate values are n_neighbors = 15, 30, 45, 60, and min_dist = 0.1.

Notice here we choose the init = ‘pca’ and metric = ‘cosine’.

n_neighbors = 15, min_dist = 0.1.

Code
%time
umap = UMAP(n_neighbors=15, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap = umap.fit_transform(df_sbert)
CPU times: total: 0 ns
Wall time: 0 ns

n_neighbors = 30, min_dist = 0.1.

Code
%time
umap = UMAP(n_neighbors=30, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_1 = umap.fit_transform(df_sbert)
CPU times: total: 0 ns
Wall time: 0 ns

n_neighbors = 45, min_dist = 0.1.

Code
%time
umap = UMAP(n_neighbors=45, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_2 = umap.fit_transform(df_sbert)
CPU times: total: 0 ns
Wall time: 0 ns

n_neighbors = 60, min_dist = 0.1.

Code
%time
umap = UMAP(n_neighbors=60, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_3 = umap.fit_transform(df_sbert)
CPU times: total: 0 ns
Wall time: 0 ns
Code
fig, axes = plt.subplots(2,2, figsize = (16, 12))

df_umap = pd.DataFrame(mat_umap)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2, ax = axes[0,0])
axes[0,0].set_xlabel("x_projected")
axes[0,0].set_ylabel("y_projected")
axes[0,0].set_title('n_neighbors: 15, min_dist = 0.1')


df_umap = pd.DataFrame(mat_umap_1)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2, ax = axes[0,1])
axes[0,1].set_xlabel("x_projected")
axes[0,1].set_ylabel("y_projected")
axes[0,1].set_title('n_neighbors: 30, min_dist = 0.1')

df_umap = pd.DataFrame(mat_umap_2)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2, ax = axes[1,0])
axes[1,0].set_xlabel("x_projected")
axes[1,0].set_ylabel("y_projected")
axes[1,0].set_title('n_neighbors: 45, min_dist = 0.1')

df_umap = pd.DataFrame(mat_umap_3)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2, ax = axes[1,1])
axes[1,1].set_xlabel("x_projected")
axes[1,1].set_ylabel("y_projected")
axes[1,1].set_title('n_neighbors: 60, min_dist = 0.1')

fig.suptitle('UMAP', fontsize = 24)

plt.tight_layout()
plt.show()

Select n_neighbors = 45, min_dist = 0.1 as our representative umap plot.

Code
fig, ax = plt.subplots()
df_umap = pd.DataFrame(mat_umap_2)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2)
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
ax.set_title('n_neighbors: 45, min_dist = 0.1')
plt.show()

We see a main larger cluster surronded by 3 dense smaller clusters, they are not well-separated. Two of the smaller clusters are connected within bridge points, and additionally considering the shape of the clusters, they seem not convex, algorithms like k-means and hierarchical clustering are not suitable to our case.

6.1.1 Reduced Dimension

Before implementing the clustering algorithm, dimensionality reduction step is needed, as the sbert embedded vectors in 384-dimension are suffering ‘curse of dimensionality’, the distance between any 2 points converges and becomes nearly equidistant in the high dimensional space. By reducing dimensions to a latent space, UMAP concentrates the variance and re-establishes a meaningful distance metric, making it easier for algorithms to identify distinct boundaries.

We use the trustworthiness to evaluate optimal reducing dimension. The parameter n_neighbors needs to be tuned for the calculation of cluster trustworthiness, four values (15, 30, 50, 100) of n_neighbors will be tested, a small value of n_neighbors tends to preserve local clustering structure, while a larger value tends to preserve the global clustering structure, avoiding overffing.

Code
df_sbert['rating'] = review_df_all_deduplicate['rating']

sample_ratio = 0.3

_, df_sbert_sample = train_test_split(df_sbert, test_size=sample_ratio, stratify=df_sbert['rating'], 
                                random_state=42)
Code
umap = UMAP(n_neighbors=45, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_cluster_sample_visual = umap.fit_transform(df_sbert_sample)

fig, ax = plt.subplots()
df_umap = pd.DataFrame(mat_umap_cluster_sample_visual)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2)
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
ax.set_title(f'Sampled Texts (sample ratio {sample_ratio})')
plt.show()

PCA

Code
mat_normalized = normalize(df_sbert_sample.to_numpy(), norm='l2')
pca = PCA(n_components=200, random_state= 42).fit(mat_normalized)
cumvar = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(6,4))
plt.plot(range(1, len(cumvar)+1), cumvar, marker="o")
plt.axhline(0.9, color="r", ls="--", lw=1) 
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.ylim(0, 1.01)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('principal component number containing 90% variance: ',np.argmin(cumvar <= 0.9) + 1)

principal component number containing 90% variance:  140
Code
umap = UMAP(n_neighbors=45, min_dist = 0.1, n_components=140, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_cluster = umap.fit_transform(df_sbert_sample)
Code
umap = UMAP(n_neighbors=45, min_dist = 0.1, n_components=2, init='pca', metric='cosine', n_jobs=-1, verbose=False)
mat_umap_cluster_visual = umap.fit_transform(mat_umap_cluster)

fig, ax = plt.subplots()
df_umap = pd.DataFrame(mat_umap_cluster_visual)
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], alpha = 0.2)
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
ax.set_title('Dimension-reduced Texts')
plt.show()

6.2 HDBSCAN

Code
hdb = hdbscan.HDBSCAN(min_cluster_size= 200, min_samples=5, core_dist_n_jobs=-1)
hdb.fit(mat_umap_cluster)
HDBSCAN(core_dist_n_jobs=-1, min_cluster_size=200, min_samples=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
labels = np.unique(hdb.labels_)
df_umap['cluster_label'] = hdb.labels_

for i in range(len(labels)):
    fig, ax = plt.subplots(figsize = (10, 8))
    sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], color = 'gray', alpha = 0.2, ax = ax)


    mask = df_umap['cluster_label'] == labels[i]
    sns.scatterplot(x=df_umap[mask].iloc[:,0], y=df_umap[mask].iloc[:,1], color = 'red', alpha = 0.5, ax = ax)
    ax.set_title(f'cluster label: {labels[i]}')
    plt.savefig(f'./pic/hdb_label_{labels[i]}.png')
    plt.close()

pic
Code
df_umap['cluster_label'] = hdb.labels_

def show_pic(label, ax):
    label = label
    sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], color = 'gray', alpha = 0.2, ax = ax)
    
    mask = df_umap['cluster_label'].isin(label)
    sns.scatterplot(x=df_umap[mask].iloc[:,0], y=df_umap[mask].iloc[:,1], color = 'red', alpha = 0.5, ax = ax)
    ax.set_title(f'cluster label: {label}')
    ax.set_xlabel('')
    ax.set_ylabel('')
Code
fig, axes = plt.subplots(1,5, figsize = (30, 5))

labels = [0]
show_pic(labels, axes[0])

labels = [17]
show_pic(labels, axes[1])

labels = [6, 36,70, 71, 72]
show_pic(labels, axes[2])

labels = [31]
show_pic(labels, axes[3])

labels = [29, 60, 73, 74, 75, 76, 77]
show_pic(labels, axes[4])

fig, axes = plt.subplots(1,5, figsize = (30, 5))

labels = [48, 53]
show_pic(labels, axes[0])

labels = [5]
show_pic(labels, axes[1])

labels = [2]
show_pic(labels, axes[2])

labels = [8]
show_pic(labels, axes[3])

labels = [16]
show_pic(labels, axes[4])

fig, axes = plt.subplots(1,5, figsize = (30, 5))

labels = [26]
show_pic(labels, axes[0])

labels = [34]
show_pic(labels, axes[1])

labels = [32, 33, 38, 41, 43, 44, 55, 58, 68, 69]
show_pic(labels, axes[2])

labels = [32, 35, 39, 40]
show_pic(labels, axes[3])

labels = [12, 13, 14, 18, 21, 22, 23, 24, 27, 28, 42, 45, 46, 47, 49, 50, 52, 56, 57, 59, 61, 
          62, 63, 64, 65, 66, 67, 68
          ]
show_pic(labels, axes[4])


plt.tight_layout()

6.3 Clear Outlier

Code
df_umap_cluster = pd.DataFrame(mat_umap_cluster)
df_umap_cluster['cluster_label'] = hdb.labels_
Code
labels = [0]
a = df_umap_cluster[df_umap_cluster['cluster_label'].isin(labels)]

pdist(a, metric = 'euclidean')  
array([0.55560277, 0.71450533, 0.12508089, ..., 0.76212363, 0.72282339,
       0.543884  ])
Code
sorted(pdist(a, metric = 'euclidean'), reverse = True)
[np.float64(1.8881247626060074),
 np.float64(1.8859719449657026),
 np.float64(1.885146710247753),
 np.float64(1.8795802876844763),
 np.float64(1.8777406393403777),
 np.float64(1.8738276270798984),
 np.float64(1.8720288374026757),
 np.float64(1.8656323970331878),
 np.float64(1.8638781158775635),
 np.float64(1.8627787087359635),
 np.float64(1.8605047746611825),
 np.float64(1.858922413915954),
 np.float64(1.8561808162816469),
 np.float64(1.8553506703491531),
 np.float64(1.855289173418172),
 np.float64(1.853969477882373),
 np.float64(1.8529455554405947),
 np.float64(1.8523144184700135),
 np.float64(1.8517244381410976),
 np.float64(1.8467450252406425),
 np.float64(1.844566942919656),
 np.float64(1.844298509405662),
 np.float64(1.8413876047978281),
 np.float64(1.8406001028764114),
 np.float64(1.8374354750709214),
 np.float64(1.837359980815486),
 np.float64(1.836649452827562),
 np.float64(1.8364117981788073),
 np.float64(1.8355552560477661),
 np.float64(1.8338720883491204),
 np.float64(1.833624486488041),
 np.float64(1.8320685135989512),
 np.float64(1.8289829134118103),
 np.float64(1.8289489637150407),
 np.float64(1.826951470376184),
 np.float64(1.826239294365822),
 np.float64(1.8260474225769847),
 np.float64(1.8246401625729372),
 np.float64(1.8232114106699797),
 np.float64(1.8214908360522022),
 np.float64(1.8214040332249688),
 np.float64(1.821296364688217),
 np.float64(1.821209536471448),
 np.float64(1.8206157735699613),
 np.float64(1.820536584176564),
 np.float64(1.8205237744044775),
 np.float64(1.8203094586343265),
 np.float64(1.8198840011319024),
 np.float64(1.8198264182029933),
 np.float64(1.8190601002838773),
 np.float64(1.8181147072624948),
 np.float64(1.8178075041086421),
 np.float64(1.8174551176853926),
 np.float64(1.8170876494591786),
 np.float64(1.8165244943598728),
 np.float64(1.8161363514211746),
 np.float64(1.8136182711790254),
 np.float64(1.8135425604196702),
 np.float64(1.8123147285722425),
 np.float64(1.8119604803996148),
 np.float64(1.8110636699013547),
 np.float64(1.810486677625354),
 np.float64(1.809852361378442),
 np.float64(1.8096892080864306),
 np.float64(1.8090903635359186),
 np.float64(1.8090570696434558),
 np.float64(1.80887038073902),
 np.float64(1.8086953997622257),
 np.float64(1.8084826453322616),
 np.float64(1.8080789129228465),
 np.float64(1.8078906536866333),
 np.float64(1.8076067104704063),
 np.float64(1.807409286837978),
 np.float64(1.8073308570185644),
 np.float64(1.806991792887066),
 np.float64(1.8064782806584512),
 np.float64(1.8060300761335915),
 np.float64(1.8056374874917678),
 np.float64(1.8048764792610192),
 np.float64(1.8046552735626378),
 np.float64(1.8046163438964584),
 np.float64(1.804612535748529),
 np.float64(1.804458968143439),
 np.float64(1.8044450552000866),
 np.float64(1.8036051636694588),
 np.float64(1.8034434025376942),
 np.float64(1.803440941300555),
 np.float64(1.8033327273634896),
 np.float64(1.8032115539374096),
 np.float64(1.8030694517261683),
 np.float64(1.80272532123528),
 np.float64(1.8026466140136554),
 np.float64(1.8019424162884214),
 np.float64(1.8012948617903162),
 np.float64(1.8010548750496296),
 np.float64(1.800780180687388),
 np.float64(1.8007042673377909),
 np.float64(1.8006773218424894),
 np.float64(1.8004262653904408),
 np.float64(1.799053965045905),
 np.float64(1.798902105812086),
 np.float64(1.798735078163306),
 np.float64(1.797832744849076),
 np.float64(1.7973909851692897),
 np.float64(1.7973143747882265),
 np.float64(1.7969374491041559),
 np.float64(1.7968473721165716),
 np.float64(1.7967415785715213),
 np.float64(1.7967208826462784),
 np.float64(1.7964504767471126),
 np.float64(1.7964499213379108),
 np.float64(1.796143738488198),
 np.float64(1.795799379009296),
 np.float64(1.794974684811348),
 np.float64(1.794883136254497),
 np.float64(1.7948221382847827),
 np.float64(1.794719325075523),
 np.float64(1.794502620446147),
 np.float64(1.7942593210432731),
 np.float64(1.7940977044603306),
 np.float64(1.7940929937712486),
 np.float64(1.7940435538392765),
 np.float64(1.793894023819767),
 np.float64(1.7937821251919241),
 np.float64(1.7934490944995876),
 np.float64(1.7932370105962347),
 np.float64(1.7929122405255677),
 np.float64(1.792864616380155),
 np.float64(1.7928080657133791),
 np.float64(1.7925614624090436),
 np.float64(1.7924538246902446),
 np.float64(1.7923721879715617),
 np.float64(1.7923085819059392),
 np.float64(1.792269701984597),
 np.float64(1.7915312594474326),
 np.float64(1.790978139561444),
 np.float64(1.7905940085987462),
 np.float64(1.7904734582177781),
 np.float64(1.7902629459699244),
 np.float64(1.790215147507311),
 np.float64(1.789848535830746),
 np.float64(1.789655276963379),
 np.float64(1.78963902024689),
 np.float64(1.789433799318617),
 np.float64(1.788927825052337),
 np.float64(1.7888145485407865),
 np.float64(1.7886501633361997),
 np.float64(1.7885930016307996),
 np.float64(1.7885579491646826),
 np.float64(1.788488949670734),
 np.float64(1.7884585397072634),
 np.float64(1.7884568404345398),
 np.float64(1.7884371190466166),
 np.float64(1.7883467312873955),
 np.float64(1.7880249957375034),
 np.float64(1.7879338200561172),
 np.float64(1.7878487450567737),
 np.float64(1.7877454151418013),
 np.float64(1.7876784798535859),
 np.float64(1.7875321665247526),
 np.float64(1.787366669142134),
 np.float64(1.7872148250351318),
 np.float64(1.7868380731138396),
 np.float64(1.7866824511428137),
 np.float64(1.786527840623158),
 np.float64(1.7859910099803193),
 np.float64(1.7858977973004726),
 np.float64(1.7855374779148638),
 np.float64(1.7853884671714568),
 np.float64(1.7852269320565408),
 np.float64(1.7851757089563842),
 np.float64(1.7851685705891642),
 np.float64(1.7850407393856242),
 np.float64(1.784607826921873),
 np.float64(1.7844841467403367),
 np.float64(1.784441895556143),
 np.float64(1.7842895978965443),
 np.float64(1.7842020772946585),
 np.float64(1.7839131814134102),
 np.float64(1.7835531274192586),
 np.float64(1.783112180382199),
 np.float64(1.7829854835315915),
 np.float64(1.78277413070008),
 np.float64(1.7827312009506429),
 np.float64(1.782577033576253),
 np.float64(1.7824372153653971),
 np.float64(1.7823820933253383),
 np.float64(1.7819268865243159),
 np.float64(1.7815013061253533),
 np.float64(1.7812956322632048),
 np.float64(1.7811251893314597),
 np.float64(1.7810883238893942),
 np.float64(1.7807076757216866),
 np.float64(1.7804876820035724),
 np.float64(1.7804227323957782),
 np.float64(1.7803995856003387),
 np.float64(1.7803498656209715),
 np.float64(1.780041432193704),
 np.float64(1.7798806102328626),
 np.float64(1.7798138854058243),
 np.float64(1.7797988925935393),
 np.float64(1.7797153771028271),
 np.float64(1.7796920680043218),
 np.float64(1.7796828936216662),
 np.float64(1.7795725385516254),
 np.float64(1.779569934906558),
 np.float64(1.77933787694825),
 np.float64(1.7792954757576929),
 np.float64(1.779156840936127),
 np.float64(1.7790296284055902),
 np.float64(1.7789768289418084),
 np.float64(1.7789706418101283),
 np.float64(1.7789318230017748),
 np.float64(1.7788552048074615),
 np.float64(1.778769898928156),
 np.float64(1.7787259348343143),
 np.float64(1.7781518814872082),
 np.float64(1.7780663161591799),
 np.float64(1.777906335719776),
 np.float64(1.7776429239551188),
 np.float64(1.7776203748447512),
 np.float64(1.7774681677157034),
 np.float64(1.7774142469642462),
 np.float64(1.7772636510945952),
 np.float64(1.7768798357273976),
 np.float64(1.7768700565455298),
 np.float64(1.7768519815802055),
 np.float64(1.7767917740875356),
 np.float64(1.7767764682533982),
 np.float64(1.7767674276018897),
 np.float64(1.776734083923032),
 np.float64(1.7764570173025196),
 np.float64(1.7764008660024027),
 np.float64(1.7763733961936052),
 np.float64(1.7762994464510407),
 np.float64(1.7760840591801441),
 np.float64(1.776082269976266),
 np.float64(1.7759259731074566),
 np.float64(1.775839258627002),
 np.float64(1.7758327738636865),
 np.float64(1.775815791274354),
 np.float64(1.7757466328977856),
 np.float64(1.775738157772565),
 np.float64(1.7756870833933638),
 np.float64(1.7756695727547818),
 np.float64(1.775581326484053),
 np.float64(1.775472527480242),
 np.float64(1.7754635211980172),
 np.float64(1.7754478177729287),
 np.float64(1.7753573759829446),
 np.float64(1.7752868954896253),
 np.float64(1.7751480661175827),
 np.float64(1.7751171477863896),
 np.float64(1.7750883674254463),
 np.float64(1.775082513531007),
 np.float64(1.774939432703387),
 np.float64(1.7749289014732033),
 np.float64(1.7747876652477892),
 np.float64(1.774753675898499),
 np.float64(1.7746538268343455),
 np.float64(1.7745078200421636),
 np.float64(1.7744620785938563),
 np.float64(1.7743883700152903),
 np.float64(1.774385117547195),
 np.float64(1.7743431618504253),
 np.float64(1.7742984731454834),
 np.float64(1.774264187013569),
 np.float64(1.774114464591442),
 np.float64(1.7740986839162933),
 np.float64(1.7739208860574223),
 np.float64(1.7738296135225815),
 np.float64(1.7738267099580063),
 np.float64(1.7736618423906774),
 np.float64(1.7735157629621134),
 np.float64(1.7735070850694594),
 np.float64(1.7734603289911401),
 np.float64(1.7734336121957248),
 np.float64(1.77329548085782),
 np.float64(1.7732281912730725),
 np.float64(1.7732098293820082),
 np.float64(1.7730714333706385),
 np.float64(1.772995218867352),
 np.float64(1.7729826696056343),
 np.float64(1.7729305006760632),
 np.float64(1.772839015397466),
 np.float64(1.7724504136754857),
 np.float64(1.7724265213400667),
 np.float64(1.7720461113391481),
 np.float64(1.7719253705212568),
 np.float64(1.7718819992301271),
 np.float64(1.7718092253657707),
 np.float64(1.7715429621117247),
 np.float64(1.7715425575927075),
 np.float64(1.771437313136572),
 np.float64(1.7713889967003014),
 np.float64(1.7709962026822879),
 np.float64(1.7709500113495509),
 np.float64(1.7708107943122176),
 np.float64(1.7707898348502773),
 np.float64(1.7707472278166412),
 np.float64(1.770366001628334),
 np.float64(1.7702303365152219),
 np.float64(1.770200255355397),
 np.float64(1.7701497019729844),
 np.float64(1.7699536942190652),
 np.float64(1.769935948950993),
 np.float64(1.7695801213126034),
 np.float64(1.769539336208651),
 np.float64(1.769506369910049),
 np.float64(1.7694646881740534),
 np.float64(1.7692353236185534),
 np.float64(1.7692048473919986),
 np.float64(1.769090633350991),
 np.float64(1.7689001274985396),
 np.float64(1.768887999588014),
 np.float64(1.7688120202157742),
 np.float64(1.7685283499417805),
 np.float64(1.768319102102896),
 np.float64(1.768285938522774),
 np.float64(1.7682528691449366),
 np.float64(1.7681951765567567),
 np.float64(1.7681473275141708),
 np.float64(1.7681037329434057),
 np.float64(1.7680611456128883),
 np.float64(1.7678088398798584),
 np.float64(1.7676869186857775),
 np.float64(1.7676663405863915),
 np.float64(1.7676472205393328),
 np.float64(1.767595585522742),
 np.float64(1.7675924845284343),
 np.float64(1.7675208385010954),
 np.float64(1.7675136526646162),
 np.float64(1.7673388838808708),
 np.float64(1.7672933372187651),
 np.float64(1.7671966675904247),
 np.float64(1.76715421835354),
 np.float64(1.7671247990941141),
 np.float64(1.7670521147218254),
 np.float64(1.7669937666516977),
 np.float64(1.7669686840892882),
 np.float64(1.7668311989723726),
 np.float64(1.766795220736823),
 np.float64(1.7667562505018508),
 np.float64(1.7666273944502762),
 np.float64(1.7666183195065228),
 np.float64(1.766430045637103),
 np.float64(1.7659571737339976),
 np.float64(1.765943332396337),
 np.float64(1.7659297304717818),
 np.float64(1.7658577037490202),
 np.float64(1.7657813954775436),
 np.float64(1.7656834395817635),
 np.float64(1.7654892963025732),
 np.float64(1.7653595790879282),
 np.float64(1.7653210889917537),
 np.float64(1.7652525475957423),
 np.float64(1.7652034347668664),
 np.float64(1.764996904912697),
 np.float64(1.7649811339842965),
 np.float64(1.7649512577044952),
 np.float64(1.7649407027402333),
 np.float64(1.7649119080677236),
 np.float64(1.7648803458956241),
 np.float64(1.7647607384937802),
 np.float64(1.76473945306949),
 np.float64(1.7644800980373405),
 np.float64(1.7644502893796772),
 np.float64(1.7644125362887446),
 np.float64(1.764381400093523),
 np.float64(1.7641635066701304),
 np.float64(1.764124647165262),
 np.float64(1.7640926744384475),
 np.float64(1.7640809826251123),
 np.float64(1.7639502208912605),
 np.float64(1.7639422754289162),
 np.float64(1.7639358024088447),
 np.float64(1.763899747436743),
 np.float64(1.763862301153006),
 np.float64(1.7638308902762099),
 np.float64(1.7635532718412266),
 np.float64(1.763422372647195),
 np.float64(1.7634110813475024),
 np.float64(1.7633144441072783),
 np.float64(1.7632371668687223),
 np.float64(1.763057567274245),
 np.float64(1.762899513300689),
 np.float64(1.7628920823743424),
 np.float64(1.7628154166801675),
 np.float64(1.762559111506722),
 np.float64(1.7624679033826174),
 np.float64(1.7623091655769954),
 np.float64(1.7622716818052755),
 np.float64(1.7622020124115285),
 np.float64(1.761775838111915),
 np.float64(1.761525378994),
 np.float64(1.761362669655938),
 np.float64(1.7613295892724137),
 np.float64(1.7611988342092828),
 np.float64(1.7611949945973955),
 np.float64(1.7611330452198395),
 np.float64(1.7609204991206602),
 np.float64(1.7608907915263037),
 np.float64(1.7608438151225352),
 np.float64(1.7607456732074733),
 np.float64(1.760728269826395),
 np.float64(1.7607125613754981),
 np.float64(1.7606979733493893),
 np.float64(1.7606951879472377),
 np.float64(1.760667021963909),
 np.float64(1.7605031741153778),
 np.float64(1.7604105949832467),
 np.float64(1.7604037155223626),
 np.float64(1.7603075422131926),
 np.float64(1.7602317021071048),
 np.float64(1.7602151642203472),
 np.float64(1.760158898001381),
 np.float64(1.7601078184384937),
 np.float64(1.7600172356186041),
 np.float64(1.759860778931409),
 np.float64(1.759789130490651),
 np.float64(1.759572901431549),
 np.float64(1.7595224749408704),
 np.float64(1.7594842503669585),
 np.float64(1.7593573428136862),
 np.float64(1.7591830735130969),
 np.float64(1.7591662822959409),
 np.float64(1.7590669291940086),
 np.float64(1.7590525810264088),
 np.float64(1.7590446534420112),
 np.float64(1.7588292687127507),
 np.float64(1.7587023971172058),
 np.float64(1.758700919696005),
 np.float64(1.7586770202612736),
 np.float64(1.7586352788295212),
 np.float64(1.7585626225056634),
 np.float64(1.75846755254206),
 np.float64(1.758452557745626),
 np.float64(1.7583998285996254),
 np.float64(1.7583625537844372),
 np.float64(1.758214659161858),
 np.float64(1.7581990971911294),
 np.float64(1.7581014150754497),
 np.float64(1.7580167448437252),
 np.float64(1.757984028722064),
 np.float64(1.7578137995941714),
 np.float64(1.7577710163882354),
 np.float64(1.7577028922417945),
 np.float64(1.757619731764127),
 np.float64(1.7575909029169352),
 np.float64(1.7575880694026442),
 np.float64(1.7575415521205155),
 np.float64(1.7575171211287515),
 np.float64(1.7574683207538815),
 np.float64(1.7574283792350895),
 np.float64(1.75734084116528),
 np.float64(1.7571787950067894),
 np.float64(1.7571605113005204),
 np.float64(1.7571323235033662),
 np.float64(1.7570122216171034),
 np.float64(1.7569780753143422),
 np.float64(1.7568542674437426),
 np.float64(1.7568181467637518),
 np.float64(1.7567976968732903),
 np.float64(1.7567264898193655),
 np.float64(1.7565985043257202),
 np.float64(1.756519812913905),
 np.float64(1.7565122796795645),
 np.float64(1.7565090427997359),
 np.float64(1.756459612296336),
 np.float64(1.7563938713489762),
 np.float64(1.7563847849580227),
 np.float64(1.7562871298655862),
 np.float64(1.7562815219807095),
 np.float64(1.7562686710347328),
 np.float64(1.756257435433957),
 np.float64(1.756045195250236),
 np.float64(1.7559961872851049),
 np.float64(1.755877834410583),
 np.float64(1.7558603389231),
 np.float64(1.7558273853514295),
 np.float64(1.7558257486736926),
 np.float64(1.7557173702868107),
 np.float64(1.7556666894406363),
 np.float64(1.7556038959299594),
 np.float64(1.755556698526219),
 np.float64(1.7554593166473411),
 np.float64(1.755422332010498),
 np.float64(1.755376582853696),
 np.float64(1.7553456058942447),
 np.float64(1.7551852643278627),
 np.float64(1.755116379954805),
 np.float64(1.7551049284847149),
 np.float64(1.7550174926012343),
 np.float64(1.7549982287752537),
 np.float64(1.7548646931620278),
 np.float64(1.754849235393265),
 np.float64(1.754848584033765),
 np.float64(1.7547704061121727),
 np.float64(1.7547490077973649),
 np.float64(1.754735717258407),
 np.float64(1.7547011910607826),
 np.float64(1.7546653211181897),
 np.float64(1.7546489438340476),
 np.float64(1.7545933036173198),
 np.float64(1.7545366258939434),
 np.float64(1.7545305291628657),
 np.float64(1.7544910131735914),
 np.float64(1.7544417134323327),
 np.float64(1.754261702939884),
 np.float64(1.7542509379382414),
 np.float64(1.754186581643819),
 np.float64(1.7541022812420184),
 np.float64(1.7540754974413624),
 np.float64(1.7539575447234845),
 np.float64(1.7539234056108326),
 np.float64(1.7537733865119995),
 np.float64(1.7536167199305857),
 np.float64(1.753613335473532),
 np.float64(1.7534516151930373),
 np.float64(1.75344294094305),
 np.float64(1.753422633521042),
 np.float64(1.753317474780076),
 np.float64(1.7532570202068363),
 np.float64(1.7532139549774228),
 np.float64(1.7531912504051164),
 np.float64(1.7531676899516855),
 np.float64(1.752998582550641),
 np.float64(1.75286244669612),
 np.float64(1.7528617798273003),
 np.float64(1.7528549650821994),
 np.float64(1.7528376901931428),
 np.float64(1.7528341235791907),
 np.float64(1.752805451374111),
 np.float64(1.7527453847503878),
 np.float64(1.7527086739210014),
 np.float64(1.7526853698135785),
 np.float64(1.7526733450769942),
 np.float64(1.7526625676131196),
 np.float64(1.7524900062094972),
 np.float64(1.75247913596575),
 np.float64(1.752202459806609),
 np.float64(1.7521579994223295),
 np.float64(1.7521357348337259),
 np.float64(1.752132595479766),
 np.float64(1.7521251523166492),
 np.float64(1.7519465928688436),
 np.float64(1.7519443821330556),
 np.float64(1.7519305997783383),
 np.float64(1.751843395582741),
 np.float64(1.7518414598781002),
 np.float64(1.7517939474529203),
 np.float64(1.7517823258903407),
 np.float64(1.7516341188595348),
 np.float64(1.7516043679632893),
 np.float64(1.7515888871140861),
 np.float64(1.751425375480137),
 np.float64(1.751411257122321),
 np.float64(1.751385516003328),
 np.float64(1.7512078252806424),
 np.float64(1.7512071342046818),
 np.float64(1.7511204973679266),
 np.float64(1.7510886711279328),
 np.float64(1.7510578824980199),
 np.float64(1.7510564047836807),
 np.float64(1.7508670742097594),
 np.float64(1.750842017772079),
 np.float64(1.7508370596683167),
 np.float64(1.7508174127625613),
 np.float64(1.7507746772118151),
 np.float64(1.7507154558306182),
 np.float64(1.7506870644298924),
 np.float64(1.7506733365054878),
 np.float64(1.7505746748245568),
 np.float64(1.7504909938199287),
 np.float64(1.7504451174755753),
 np.float64(1.7503373082820906),
 np.float64(1.7503054524999244),
 np.float64(1.7502687968911985),
 np.float64(1.7502620848113442),
 np.float64(1.750227294157807),
 np.float64(1.7502151332925902),
 np.float64(1.750151857205301),
 np.float64(1.7501313751449243),
 np.float64(1.7500986730741792),
 np.float64(1.7500946570722604),
 np.float64(1.7500853754755463),
 np.float64(1.7500496280657414),
 np.float64(1.7500470141925688),
 np.float64(1.749980524511514),
 np.float64(1.749943059104558),
 np.float64(1.749904730884115),
 np.float64(1.7498917080692493),
 np.float64(1.7497960965623878),
 np.float64(1.7497927532666178),
 np.float64(1.7497216974784975),
 np.float64(1.7496437716853117),
 np.float64(1.7495764773017564),
 np.float64(1.7494444716289326),
 np.float64(1.749356503546346),
 np.float64(1.749328028165015),
 np.float64(1.7493180470431893),
 np.float64(1.74928465102392),
 np.float64(1.7492063698235294),
 np.float64(1.749202534376606),
 np.float64(1.7491802784427062),
 np.float64(1.74903606777279),
 np.float64(1.7490326555836204),
 np.float64(1.7490270773631773),
 np.float64(1.7489695714741444),
 np.float64(1.7489652002203844),
 np.float64(1.7488590433075908),
 np.float64(1.748786820643164),
 np.float64(1.7486933916577827),
 np.float64(1.7486129874368146),
 np.float64(1.748579143061939),
 np.float64(1.748498618394063),
 np.float64(1.7484942316203649),
 np.float64(1.748488936651251),
 np.float64(1.7484458629279434),
 np.float64(1.7484149885623728),
 np.float64(1.7484126162011664),
 np.float64(1.7483981722748496),
 np.float64(1.748365419237361),
 np.float64(1.748358157290665),
 np.float64(1.7483538264906073),
 np.float64(1.7483403913655522),
 np.float64(1.7483166670589199),
 np.float64(1.7482556479483624),
 np.float64(1.7482338657173413),
 np.float64(1.7481242521093956),
 np.float64(1.7480988109471771),
 np.float64(1.7480916815233438),
 np.float64(1.7480704217077068),
 np.float64(1.7480587144169315),
 np.float64(1.7480571155980222),
 np.float64(1.7480137460551783),
 np.float64(1.747980137841433),
 np.float64(1.7479128624046634),
 np.float64(1.747900680514455),
 np.float64(1.7478845859857743),
 np.float64(1.7478150815989761),
 np.float64(1.7477563177861593),
 np.float64(1.7477403292348024),
 np.float64(1.7477355308002944),
 np.float64(1.7476814170614723),
 np.float64(1.7476740228189804),
 np.float64(1.7476443639470816),
 np.float64(1.7474280027062679),
 np.float64(1.7474124096971988),
 np.float64(1.7473961480223033),
 np.float64(1.7473879703657922),
 np.float64(1.7473552170534694),
 np.float64(1.747336452198571),
 np.float64(1.7471679489943217),
 np.float64(1.7471564712991272),
 np.float64(1.7471547666695466),
 np.float64(1.7471170404504277),
 np.float64(1.7470872454122368),
 np.float64(1.7469504913138656),
 np.float64(1.746935884266995),
 np.float64(1.7468879275292661),
 np.float64(1.74687509450878),
 np.float64(1.7466901043414464),
 np.float64(1.7465611934883671),
 np.float64(1.7465477612605846),
 np.float64(1.7464268368925397),
 np.float64(1.7463786897200344),
 np.float64(1.746362117269362),
 np.float64(1.7463335310351626),
 np.float64(1.74629015317077),
 np.float64(1.7462569459378419),
 np.float64(1.7461623316193062),
 np.float64(1.7460892097417076),
 np.float64(1.7460184051118108),
 np.float64(1.745994661556646),
 np.float64(1.7459140219501885),
 np.float64(1.7458892169309799),
 np.float64(1.7458597028214384),
 np.float64(1.7457860092495534),
 np.float64(1.7457677744172468),
 np.float64(1.7457497869165868),
 np.float64(1.7457323814159822),
 np.float64(1.7456571907616523),
 np.float64(1.7456223258670818),
 np.float64(1.7455980013062125),
 np.float64(1.745586643079853),
 np.float64(1.7455165564796502),
 np.float64(1.7453919006436656),
 np.float64(1.7452591635771049),
 np.float64(1.7452562234576612),
 np.float64(1.7452480337634202),
 np.float64(1.7452393978767458),
 np.float64(1.7452216208945779),
 np.float64(1.7452121235990794),
 np.float64(1.7451703427625027),
 np.float64(1.7451618694324416),
 np.float64(1.7451116831882774),
 np.float64(1.7450876959858086),
 np.float64(1.7450821962808163),
 np.float64(1.7450650698145374),
 np.float64(1.7449518918115499),
 np.float64(1.74495170898624),
 np.float64(1.7449022451867267),
 np.float64(1.7448922996857354),
 np.float64(1.7447938455855094),
 np.float64(1.7447860031616889),
 np.float64(1.7447622226744735),
 np.float64(1.7447262786527757),
 np.float64(1.744721826627513),
 np.float64(1.744705187859363),
 np.float64(1.7446682715682764),
 np.float64(1.7446530269110354),
 np.float64(1.7446472977030332),
 np.float64(1.744617492462535),
 np.float64(1.7445625761469132),
 np.float64(1.7445586477624702),
 np.float64(1.7445405316982867),
 np.float64(1.7445251256858414),
 np.float64(1.7445106613005232),
 np.float64(1.7444794447467449),
 np.float64(1.7444448672305473),
 np.float64(1.7443668280063163),
 np.float64(1.7443408844749468),
 np.float64(1.7443196280417705),
 np.float64(1.7443179267949722),
 np.float64(1.7443018989197474),
 np.float64(1.7443011981566376),
 np.float64(1.744274999743174),
 np.float64(1.7442454310084345),
 np.float64(1.7442140133455635),
 np.float64(1.7441816816528388),
 np.float64(1.7441503138503016),
 np.float64(1.7440982436690948),
 np.float64(1.7440849431488419),
 np.float64(1.744002769761801),
 np.float64(1.7439910102506826),
 np.float64(1.7439894168385743),
 np.float64(1.7439863092387855),
 np.float64(1.74396548594701),
 np.float64(1.7438595519125821),
 np.float64(1.7437928844550088),
 np.float64(1.7437832643139248),
 np.float64(1.7437718670834117),
 np.float64(1.743756447881935),
 np.float64(1.7437483873255495),
 np.float64(1.7437234341076404),
 np.float64(1.7437046850027516),
 np.float64(1.7437037725014277),
 np.float64(1.7436941744378367),
 np.float64(1.7436689377407297),
 np.float64(1.7436561994899986),
 np.float64(1.743654862961156),
 np.float64(1.7436528062392915),
 np.float64(1.7436032159479244),
 np.float64(1.7436022414867072),
 np.float64(1.7435830382115862),
 np.float64(1.743547272800971),
 np.float64(1.7435229254320324),
 np.float64(1.7434772949057418),
 np.float64(1.7433877705247813),
 np.float64(1.7433732874364865),
 np.float64(1.743337746357611),
 np.float64(1.7432755681932501),
 np.float64(1.7432241493770109),
 np.float64(1.743132602105016),
 np.float64(1.7430747501122048),
 np.float64(1.7430674867436764),
 np.float64(1.7430361168290671),
 np.float64(1.743034510623572),
 np.float64(1.742881820372841),
 np.float64(1.742878453074242),
 np.float64(1.742852782054071),
 np.float64(1.7427161137181975),
 np.float64(1.7426933244487806),
 np.float64(1.7426929923788868),
 np.float64(1.7426924799694055),
 np.float64(1.742679899649995),
 np.float64(1.742649114171869),
 np.float64(1.7425761633379002),
 np.float64(1.742562997541885),
 np.float64(1.7425029118787168),
 np.float64(1.7424616908753214),
 np.float64(1.7424164559748274),
 np.float64(1.742406924230936),
 np.float64(1.742406513598291),
 np.float64(1.742404393802851),
 np.float64(1.7423995078914651),
 np.float64(1.7423871187641287),
 np.float64(1.7422939637206558),
 np.float64(1.742242986224969),
 np.float64(1.742191186232321),
 np.float64(1.742171200066967),
 np.float64(1.7421624125436916),
 np.float64(1.7421459562235437),
 np.float64(1.7419674780891268),
 np.float64(1.7419494298027103),
 np.float64(1.7418956421955725),
 np.float64(1.7418633527220706),
 np.float64(1.7417828589823623),
 np.float64(1.7417770427300998),
 np.float64(1.741768986750683),
 np.float64(1.7417429625583276),
 np.float64(1.741732810243859),
 np.float64(1.7417201171176093),
 np.float64(1.7416617325749306),
 np.float64(1.7415899886892927),
 np.float64(1.7415651804720425),
 np.float64(1.7415375426982949),
 np.float64(1.7414633135111062),
 np.float64(1.741449752561503),
 np.float64(1.7414212729560858),
 np.float64(1.741327649508456),
 np.float64(1.7412811822448302),
 np.float64(1.7412768566193855),
 np.float64(1.7412436902022932),
 np.float64(1.7411374803418402),
 np.float64(1.7410435530534363),
 np.float64(1.741008299091907),
 np.float64(1.7410043180834345),
 np.float64(1.74091539230804),
 np.float64(1.7408850648262923),
 np.float64(1.7408836371257068),
 np.float64(1.740869262491339),
 np.float64(1.740852537579883),
 np.float64(1.7408456306990827),
 np.float64(1.740837673863791),
 np.float64(1.7408367558827238),
 np.float64(1.7408226570774),
 np.float64(1.7407518172373797),
 np.float64(1.740678838660295),
 np.float64(1.7405885715213614),
 np.float64(1.7405359356124093),
 np.float64(1.7405278891337728),
 np.float64(1.7405206622705287),
 np.float64(1.740420041835543),
 np.float64(1.7403521049581687),
 np.float64(1.74033400657402),
 np.float64(1.7403211911082133),
 np.float64(1.740270102999291),
 np.float64(1.7402603920815312),
 np.float64(1.7402493219684785),
 np.float64(1.7401934268305723),
 np.float64(1.7401652756711505),
 np.float64(1.7401646567842488),
 np.float64(1.7401280564976969),
 np.float64(1.7401014248348323),
 np.float64(1.7400168239281857),
 np.float64(1.7399915530633494),
 np.float64(1.7399723448130382),
 np.float64(1.7399609409798262),
 np.float64(1.739956392602856),
 np.float64(1.7399238586358128),
 np.float64(1.7399211994866437),
 np.float64(1.7398106652289926),
 np.float64(1.7398094533677824),
 np.float64(1.7397961172435639),
 np.float64(1.739793861491526),
 np.float64(1.7397879620866192),
 np.float64(1.7397710688455381),
 np.float64(1.739743557780796),
 np.float64(1.7397036827284864),
 np.float64(1.739695274059081),
 np.float64(1.739692922687305),
 np.float64(1.7396540570101808),
 np.float64(1.739628633979772),
 np.float64(1.7395627974517702),
 np.float64(1.7395429152808772),
 np.float64(1.739531509385231),
 np.float64(1.739527062524052),
 np.float64(1.7394165143397067),
 np.float64(1.7393547335499233),
 np.float64(1.739268395468359),
 np.float64(1.7392595459447833),
 np.float64(1.7392311101108902),
 np.float64(1.7391455652370578),
 np.float64(1.7391180588194126),
 np.float64(1.7391161023258375),
 np.float64(1.739092216570938),
 np.float64(1.7390894960317187),
 np.float64(1.7390409694579338),
 np.float64(1.7389982058908695),
 np.float64(1.7389946817814603),
 np.float64(1.7389904388306061),
 np.float64(1.7389695093850313),
 np.float64(1.7389202516813949),
 np.float64(1.7388406598638324),
 np.float64(1.738798642746345),
 np.float64(1.7387646548281903),
 np.float64(1.7386966061336484),
 np.float64(1.7386801649645969),
 np.float64(1.7386764716969678),
 np.float64(1.7386578180792558),
 np.float64(1.7386572928542001),
 np.float64(1.7386202408355091),
 np.float64(1.7385971571108094),
 np.float64(1.7385731827773172),
 np.float64(1.7385453185198931),
 np.float64(1.7385296680112126),
 np.float64(1.7384971096981534),
 np.float64(1.738482262106526),
 np.float64(1.738478279783017),
 np.float64(1.7384456668803836),
 np.float64(1.7384353429567263),
 np.float64(1.7384194225557497),
 np.float64(1.7383339015223365),
 np.float64(1.738282763950091),
 np.float64(1.7382409331242397),
 np.float64(1.7382055877259996),
 np.float64(1.7381804099510365),
 np.float64(1.738110353106737),
 np.float64(1.7381102420055088),
 np.float64(1.7379355727817967),
 np.float64(1.7379258267681241),
 np.float64(1.7379142665719387),
 np.float64(1.7379000221892),
 np.float64(1.7378864737331334),
 np.float64(1.7378454177133202),
 np.float64(1.7378266203688995),
 np.float64(1.7378034469412633),
 np.float64(1.7377877364938363),
 np.float64(1.7377280032599074),
 np.float64(1.737654578910065),
 np.float64(1.737633632394995),
 np.float64(1.7375873912210498),
 np.float64(1.7374138405324926),
 np.float64(1.737410696136459),
 np.float64(1.73740140167876),
 np.float64(1.7373885969994078),
 np.float64(1.7373360996836302),
 np.float64(1.7373180707772722),
 np.float64(1.7371714666326248),
 np.float64(1.7371703132436902),
 np.float64(1.7371363935847157),
 np.float64(1.737121796600795),
 np.float64(1.7371015231066778),
 np.float64(1.737092508702222),
 np.float64(1.737082986605399),
 np.float64(1.737052341674262),
 np.float64(1.736879450109858),
 np.float64(1.736872540934046),
 np.float64(1.7367833316940233),
 np.float64(1.7367497404023196),
 np.float64(1.7367451053904484),
 np.float64(1.736726576584828),
 np.float64(1.736717545706121),
 np.float64(1.736705147875762),
 np.float64(1.7366921962258561),
 np.float64(1.7366917658716796),
 np.float64(1.7366647104806288),
 np.float64(1.7366547482571282),
 np.float64(1.7366325254937718),
 np.float64(1.736630801717887),
 np.float64(1.7366301487011297),
 np.float64(1.7366301322816993),
 np.float64(1.7365612028455584),
 np.float64(1.7365512634840428),
 np.float64(1.7365201338422578),
 np.float64(1.7364247133031208),
 np.float64(1.73641970878988),
 np.float64(1.7363662423370796),
 np.float64(1.7362404439014365),
 np.float64(1.7362099617758455),
 np.float64(1.7361306839343216),
 np.float64(1.7361059107117336),
 np.float64(1.7360618233772551),
 np.float64(1.7359449577590624),
 np.float64(1.7359042283075017),
 np.float64(1.7358670210755318),
 np.float64(1.7358667128927547),
 np.float64(1.7358511038227795),
 np.float64(1.7358389083587116),
 np.float64(1.73581801011592),
 np.float64(1.7358174955232595),
 np.float64(1.7357872137572452),
 np.float64(1.7357811721876273),
 np.float64(1.7357424452511248),
 np.float64(1.7355673457012677),
 np.float64(1.7355594547508126),
 np.float64(1.7355510231384927),
 np.float64(1.735518830068444),
 np.float64(1.7354179994387244),
 np.float64(1.7354045722865856),
 np.float64(1.735288438965566),
 np.float64(1.7352823934254675),
 np.float64(1.7352756659808741),
 np.float64(1.7352693931665877),
 np.float64(1.7352531767488522),
 np.float64(1.7352493557652129),
 np.float64(1.7352094253974428),
 np.float64(1.7351446012055534),
 np.float64(1.735078088189125),
 np.float64(1.7350680974747368),
 np.float64(1.735048438406508),
 np.float64(1.7349791093790625),
 np.float64(1.73497541680148),
 np.float64(1.7349562475783302),
 np.float64(1.7349446610507715),
 np.float64(1.7349334472415423),
 np.float64(1.734839852207084),
 np.float64(1.7347916131518377),
 ...]
Code
dist = squareform(pdist(a, metric = 'euclidean'))
dist = pd.DataFrame(dist, index=a.index)

mask = np.sum(dist > 1.2, axis = 1) 
a = a.loc[mask[mask == 0].index].iloc[:,:-1]
Code
a
0 1 2 3 4 5 6 7 8 9 ... 130 131 132 133 134 135 136 137 138 139
69 8.485096 7.114382 6.052022 4.440410 5.365368 2.744745 3.035398 3.475390 4.038151 5.120708 ... 4.115559 5.099222 4.003700 5.015654 5.410122 4.535675 5.180325 4.447740 5.189550 5.240172
125 8.633148 7.221770 6.291441 4.481538 5.380897 2.719591 2.967602 3.418453 4.054897 5.139610 ... 4.113472 5.103001 4.003533 5.021810 5.415418 4.535436 5.179959 4.439208 5.181566 5.243672
345 8.722537 7.277019 6.338087 4.515296 5.341307 2.842889 2.998750 3.544150 4.092784 5.112059 ... 4.112357 5.098690 4.008207 5.030266 5.412497 4.532458 5.174282 4.441049 5.180226 5.244901
368 8.684679 7.231976 6.243512 4.557955 5.320023 2.733992 2.980900 3.427756 4.085420 5.126325 ... 4.116591 5.099125 4.003501 5.013340 5.414765 4.530433 5.180703 4.442517 5.178005 5.239195
511 8.643785 7.231339 6.270880 4.412459 5.425356 2.801544 2.984237 3.514085 4.038185 5.105775 ... 4.112610 5.102696 4.007766 5.029111 5.411676 4.531614 5.175664 4.440275 5.186156 5.245007
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
157826 8.686310 7.251465 6.304925 4.545490 5.340065 2.689251 2.945246 3.378734 4.075504 5.142323 ... 4.115281 5.102278 4.001506 5.013481 5.417091 4.532234 5.182118 4.438879 5.176321 5.240269
157863 8.616748 7.235164 6.080451 4.242522 5.379309 2.862253 3.011707 3.668183 4.040038 5.042645 ... 4.112945 5.107568 4.009701 5.043721 5.402246 4.537492 5.166680 4.443984 5.189363 5.250729
157878 8.600643 7.200023 6.138275 4.546696 5.305594 2.731719 3.015780 3.458200 4.082753 5.128393 ... 4.116413 5.096345 4.002311 5.007275 5.413395 4.531526 5.181302 4.446807 5.183251 5.236391
157961 8.666770 7.276140 6.344491 4.547310 5.366400 2.744121 2.958816 3.418584 4.072193 5.160999 ... 4.111676 5.100099 4.003363 5.021746 5.418046 4.534997 5.180692 4.438596 5.179993 5.242696
158145 8.529245 7.285178 6.167300 4.257118 5.425672 2.751566 2.960698 3.558579 4.022826 5.105327 ... 4.109113 5.110326 4.004951 5.038415 5.410239 4.542336 5.171570 4.437569 5.191112 5.251178

1694 rows × 140 columns

Code
fig, ax = plt.subplots(figsize = (10, 8))
sns.scatterplot(x=df_umap.iloc[:,0], y=df_umap.iloc[:,1], color = 'gray', alpha = 0.2, ax = ax)
sns.scatterplot(x=df_umap.loc[a.index].iloc[:,0], y=df_umap.loc[a.index].iloc[:,1],  color = 'red', alpha = 0.5, ax = ax)
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
#ax.set_title('Dimension-reduced Texts')
plt.show()

7. Interpretate Result

Code
kmeans = KMeans(n_clusters= 9, random_state=42)
kmeans.fit(mat_pca)
KMeans(n_clusters=9, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
df_sample = review_df_all_deduplicate.loc[sample_idx]
df_sample['predicted_label'] = kmeans.labels_

Plot all the numercial varaibles (except massive na valued variables) across clusters.

Code
num_col = ['rating', 'total_feedback_count', 'total_neg_feedback_count', 'total_pos_feedback_count']

for num in num_col :
    plt.figure(figsize=(6,4))
    sns.boxplot(data=df_sample, x='predicted_label', y='rating')  
    plt.xlabel('Predicted cluster')
    plt.ylabel(f'{num}')
    plt.title('num_col by cluster')
    plt.tight_layout()
    plt.show()

Code
print(df_sample.groupby('predicted_label').agg({'rating': 'mean', 'review_text': 'count'    
}).sort_values(by='rating'))
                   rating  review_text
predicted_label                       
1                3.994995          999
3                4.013065          995
7                4.183486         2507
6                4.359102          401
8                4.361702          517
4                4.387047         1297
5                4.399446         2168
2                4.400000          945
0                4.520947          549

The table summarizes the statistical profile of the 9 identified clusters, sorted by average rating. We observe a narrow rating distribution ranging from 3.99 to 4.52, indicating an overall positive sentiment across the sampled dataset. Cluster 7 is the dominant group by volume, while Cluster 1 and Cluster 0 represent the lowest and highest satisfaction extremes, respectively.

Code
sorted_clusters = sorted(df_sample['predicted_label'].unique())


print("-" * 60)

for cluster_id in sorted_clusters:

    cluster_data = df_sample[df_sample['predicted_label'] == cluster_id]
    
    combined_text = " ".join(cluster_data['review_text_cleaned'].astype(str).tolist())

    ngrams_list = generate_ngrams(combined_text, n_gram=2)
    
    top_keywords = Counter(ngrams_list).most_common(10)
    
    print(f"Cluster {cluster_id} (Size: {len(cluster_data)})")
    print(f"Top Keywords: {top_keywords}")
    print("-" * 60)
------------------------------------------------------------
Cluster 0 (Size: 549)
Top Keywords: [('dermalogica sampling', 91), ('gifted dermalogica', 78), ('dead skin', 73), ('sensitive skin', 72), ('daily microfoliant', 59), ('skin feeling', 52), ('complimentary dermalogica', 52), ('leaves skin', 49), ('skin feels', 49), ('dry skin', 44)]
------------------------------------------------------------
Cluster 1 (Size: 999)
Top Keywords: [('full size', 68), ('love product', 51), ('received product', 40), ('received sample', 35), ('sample size', 33), ('skin feel', 31), ('goes long', 29), ('long way', 29), ('highly recommend', 27), ('product free', 27)]
------------------------------------------------------------
Cluster 2 (Size: 945)
Top Keywords: [('makeup remover', 156), ('eye makeup', 129), ('remove makeup', 94), ('removes makeup', 78), ('waterproof mascara', 77), ('cleansing balm', 76), ('leaves skin', 69), ('skin feeling', 62), ('removing makeup', 60), ('love product', 58)]
------------------------------------------------------------
Cluster 3 (Size: 995)
Top Keywords: [('received product', 69), ('leaves skin', 60), ('sensitive skin', 59), ('skin feel', 56), ('long way', 55), ('goes long', 53), ('love product', 48), ('dry skin', 47), ('makes skin', 47), ('skin feeling', 46)]
------------------------------------------------------------
Cluster 4 (Size: 1297)
Top Keywords: [('dry skin', 137), ('sensitive skin', 126), ('leaves skin', 115), ('face wash', 114), ('skin feeling', 102), ('cleansing balm', 100), ('love cleanser', 97), ('skin feels', 91), ('oily skin', 81), ('acne prone', 64)]
------------------------------------------------------------
Cluster 5 (Size: 2168)
Top Keywords: [('dry skin', 374), ('skin feels', 171), ('sensitive skin', 166), ('skin feel', 151), ('oily skin', 143), ('makes skin', 136), ('long way', 122), ('goes long', 118), ('received product', 112), ('skin feeling', 109)]
------------------------------------------------------------
Cluster 6 (Size: 401)
Top Keywords: [('dry skin', 48), ('jet lag', 47), ('summer fridays', 42), ('love mask', 42), ('lag mask', 39), ('sensitive skin', 36), ('overnight mask', 27), ('face mask', 23), ('makes skin', 19), ('next morning', 18)]
------------------------------------------------------------
Cluster 7 (Size: 2507)
Top Keywords: [('sensitive skin', 269), ('acne prone', 162), ('love product', 160), ('using product', 152), ('prone skin', 132), ('dry skin', 119), ('difference skin', 117), ('skin tone', 116), ('skin feels', 113), ('made skin', 111)]
------------------------------------------------------------
Cluster 8 (Size: 517)
Top Keywords: [('sun protection', 154), ('protection factor', 136), ('white cast', 66), ('love sunscreen', 39), ('best sunscreen', 32), ('oily skin', 24), ('sunscreen used', 24), ('leave white', 23), ('doesnt leave', 23), ('unseen sunscreen', 23)]
------------------------------------------------------------

The largest portion of the dataset is driven by efficacy regarding specific skin types. Cluster 7 (the largest group) and Cluster 4 focus heavily on problem-solving for “acne prone” and “sensitive skin,” indicating that efficacy is the primary driver of engagement. Meanwhile, Cluster 5 captures the positive experiences of “dry skin” users seeking hydration.

The model isolated distinct product lines. Cluster 8 identifies “sun protection” with a specific negative sentiment regarding “white cast,” highlighting a critical product defect. Cluster 6 is unique as it is dominated by a single viral product (“Summer Fridays Jet Lag Mask”), demonstrating how specific ‘hero products’ can form their own semantic clusters.

While cluster 0, 1, and 3 are the non-product-related clusters. Cluster 0 is characterized by terms like “gifted” and “dermalogica sampling,” representing incentivized reviews that may introduce positive bias. Conversely, Cluster 1 focuses on “sample size” complaints.

Visualization of top 50 bigrams for each cluster.

Code
n_clusters = 9
n_cols = 3
n_rows = math.ceil(n_clusters / n_cols)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 15))
axes = axes.flatten() 

sorted_clusters = sorted(df_sample['predicted_label'].unique())


for i, cluster_id in enumerate(sorted_clusters):
    ax = axes[i]
    
    cluster_data = df_sample[df_sample['predicted_label'] == cluster_id]
    
    combined_text = " ".join(cluster_data['review_text_cleaned'].astype(str).tolist())
    
    ngrams_list = generate_ngrams(combined_text, n_gram=2)
    
    ngram_counts = dict(Counter(ngrams_list).most_common(50))
    
    if len(ngram_counts) > 0:
        wc = WordCloud(
            width=800, 
            height=400, 
            background_color='white', 
            colormap='viridis', 
            max_font_size=100
        ).generate_from_frequencies(ngram_counts)
        
        ax.imshow(wc, interpolation="bilinear")
        ax.set_title(f"Cluster {cluster_id} Theme\n(Size: {len(cluster_data)})", fontsize=14)
    else:
        ax.text(0.5, 0.5, "Not enough data", ha='center', va='center')
        
    ax.axis('off') 

for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()

8. Conclusion

To conclude, by replacing static star ratings with an unsupervised learning result, we transformed a sample unstructured customer feedback into 9 distinct semantic clusters. We experimented two types of algorithms: K-Means and DBSCAN, and by comparison of the 2 algorithms, we found the K-Means is a better fit to our case. Then we interpreted our clustering results in terms of their business value, and classified the 9 clusterings into 3 distinct categories: skin-related topic, product-related topic, and non-product-related noise topic.

Next Step

In this project, the whole clustering building process is only using around 10,000 sample reviews, the clustering result may not represent the whole reviews. Additionally, we identify the limitation of methods we used in this project, such as the computation limitation of the t-sne, and not the most ideal cluster assumptions for both algorithms, a more robust approach is needed.